LLM compression

LLM pruning

Large language models (LLMs) consist of many components, but not all are essential for output. Such non-critical components can be pruned to maintain performance while reducing model size.

Unstructured Pruning:
  • Involves removing parameters without considering the model's structure.
  • Sets insignificant parameters to zero, creating a sparse model.
  • It is easy to implement but hard to optimize due to its random weight distribution.
  • Requires additional processing to compress and might need retraining.
  • Notable advancements include SparseGPT (eliminates retraining) and LoRAPrune (combines low-rank adaptation with pruning).

Structured Pruning:
  • Removes whole sections, like neurons or layers.
  • Simplifies model compression and boosts hardware efficiency.
  • Requires a deep understanding of the model and might significantly impact accuracy.
  • LLM-Pruner is a promising technique that uses gradient information to prune without relying heavily on original training data.
  • Both methods aim to optimize the balance between model size and performance.

LLM Knowledge Distillation

Knowledge distillation involves training a smaller "student" model to emulate a more complex "teacher" model, effectively creating a compact yet proficient model. In the context of LLMs, this technique has two primary categories:

Standard Knowledge Distillation:

  • Transfers the broad knowledge of the teacher to the student.
  • Can use prompts and responses from models like ChatGPT to train smaller LLMs, though there are constraints related to data from commercial models.
  • MiniLLM, developed by Tsinghua University and Microsoft Research, improves the process by using specialized objective and optimization functions, addressing the challenge of accurately capturing data distributions.

Emergent Ability Distillation:

  • Targets the transfer of a specific capability from the teacher model.
  • Examples include extracting math or reasoning skills from GPT-4 to a smaller model, such as Vicuna.
  • Focusing on a narrower task set makes measuring EA distillation easier, but it's essential to recognize the limitations in transferring emergent behaviors to smaller LLMs

LLM Quantization

Large Language Models (LLMs) like GPT-3 store parameters as floating-point values, with models like GPT-3 using hundreds of gigabytes of memory. To reduce this size, a technique called quantization is used, converting parameters to smaller integers.

Benefits of Quantization:

  • Allows LLMs to run on everyday devices.
  • Examples of quantized LLMs include GPT4All and Llama.cpp.

Quantization Approaches:

  • Quantization-Aware Training (QAT): Integrates quantization during training, allowing models to learn low-precision representations. The downside is it requires training from the beginning.
  • Quantization-Aware Fine-Tuning (QAFT): Adapts a pre-trained model for lower-precision weights. Techniques like QLoRA and PEQA are used in this approach.
  • Post-Training Quantization (PTQ): Reduces precision after training without changing the architecture. It's simple and efficient but might affect accuracy.

For an in-depth exploration of LLM compression, the paper "A Survey on Model Compression for Large Language Models" is recommended.

No comments:

Post a Comment