AILAB Blog: LLM compression

LLM pruning

Large language models (LLMs) consist of many components, but not all are essential for output. Such non-critical components can be pruned to maintain performance while reducing model size.

Unstructured Pruning:

Involves removing parameters without considering the model's structure.
Sets insignificant parameters to zero, creating a sparse model.
It is easy to implement but hard to optimize due to its random weight distribution.
Requires additional processing to compress and might need retraining.
Notable advancements include SparseGPT (eliminates retraining) and LoRAPrune (combines low-rank adaptation with pruning).

Structured Pruning:

Removes whole sections, like neurons or layers.
Simplifies model compression and boosts hardware efficiency.
Requires a deep understanding of the model and might significantly impact accuracy.
LLM-Pruner is a promising technique that uses gradient information to prune without relying heavily on original training data.
Both methods aim to optimize the balance between model size and performance.

LLM Knowledge Distillation

Knowledge distillation involves training a smaller "student" model to emulate a more complex "teacher" model, effectively creating a compact yet proficient model. In the context of LLMs, this technique has two primary categories:

Standard Knowledge Distillation:

Transfers the broad knowledge of the teacher to the student.
Can use prompts and responses from models like ChatGPT to train smaller LLMs, though there are constraints related to data from commercial models.
MiniLLM, developed by Tsinghua University and Microsoft Research, improves the process by using specialized objective and optimization functions, addressing the challenge of accurately capturing data distributions.

Emergent Ability Distillation:

Targets the transfer of a specific capability from the teacher model.
Examples include extracting math or reasoning skills from GPT-4 to a smaller model, such as Vicuna.
Focusing on a narrower task set makes measuring EA distillation easier, but it's essential to recognize the limitations in transferring emergent behaviors to smaller LLMs

LLM Quantization

Large Language Models (LLMs) like GPT-3 store parameters as floating-point values, with models like GPT-3 using hundreds of gigabytes of memory. To reduce this size, a technique called quantization is used, converting parameters to smaller integers.

Benefits of Quantization:

Allows LLMs to run on everyday devices.
Examples of quantized LLMs include GPT4All and Llama.cpp.

Quantization Approaches:

Quantization-Aware Training (QAT): Integrates quantization during training, allowing models to learn low-precision representations. The downside is it requires training from the beginning.
Quantization-Aware Fine-Tuning (QAFT): Adapts a pre-trained model for lower-precision weights. Techniques like QLoRA and PEQA are used in this approach.
Post-Training Quantization (PTQ): Reduces precision after training without changing the architecture. It's simple and efficient but might affect accuracy.

For an in-depth exploration of LLM compression, the paper "A Survey on Model Compression for Large Language Models" is recommended.

AILAB Blog

9.21.2023

LLM compression

No comments:

Post a Comment