Large language models (LLMs) consist of many components, but not all are essential for output. Such non-critical components can be pruned to maintain performance while reducing model size.
- Involves removing parameters without considering the model's structure.
- Sets insignificant parameters to zero, creating a sparse model.
- It is easy to implement but hard to optimize due to its random weight distribution.
- Requires additional processing to compress and might need retraining.
- Notable advancements include SparseGPT (eliminates retraining) and LoRAPrune (combines low-rank adaptation with pruning).
- Removes whole sections, like neurons or layers.
- Simplifies model compression and boosts hardware efficiency.
- Requires a deep understanding of the model and might significantly impact accuracy.
- LLM-Pruner is a promising technique that uses gradient information to prune without relying heavily on original training data.
- Both methods aim to optimize the balance between model size and performance.
LLM Knowledge Distillation
Knowledge distillation involves training a smaller "student" model to emulate a more complex "teacher" model, effectively creating a compact yet proficient model. In the context of LLMs, this technique has two primary categories:
Standard Knowledge Distillation:
- Transfers the broad knowledge of the teacher to the student.
- Can use prompts and responses from models like ChatGPT to train smaller LLMs, though there are constraints related to data from commercial models.
- MiniLLM, developed by Tsinghua University and Microsoft Research, improves the process by using specialized objective and optimization functions, addressing the challenge of accurately capturing data distributions.
Emergent Ability Distillation:
- Targets the transfer of a specific capability from the teacher model.
- Examples include extracting math or reasoning skills from GPT-4 to a smaller model, such as Vicuna.
- Focusing on a narrower task set makes measuring EA distillation easier, but it's essential to recognize the limitations in transferring emergent behaviors to smaller LLMs
Large Language Models (LLMs) like GPT-3 store parameters as floating-point values, with models like GPT-3 using hundreds of gigabytes of memory. To reduce this size, a technique called quantization is used, converting parameters to smaller integers.
Benefits of Quantization:
- Allows LLMs to run on everyday devices.
- Examples of quantized LLMs include GPT4All and Llama.cpp.
- Quantization-Aware Training (QAT): Integrates quantization during training, allowing models to learn low-precision representations. The downside is it requires training from the beginning.
- Quantization-Aware Fine-Tuning (QAFT): Adapts a pre-trained model for lower-precision weights. Techniques like QLoRA and PEQA are used in this approach.
- Post-Training Quantization (PTQ): Reduces precision after training without changing the architecture. It's simple and efficient but might affect accuracy.
For an in-depth exploration of LLM compression, the paper "A Survey on Model Compression for Large Language Models" is recommended.