AILAB Blog: Knowledge distillation

10.26.2024

Optimizing Sub-Billion Scale Models for On-Device Applications: The MobileLLM Approach

Introduction

The proliferation of large language models (LLMs) has revolutionized numerous aspects of human interaction with technology. These models, often comprising billions of parameters, have demonstrated remarkable capabilities in understanding and generating human language. However, their deployment is often constrained by the substantial computational resources they demand, making them less suitable for on-device applications where memory and processing power are limited. This blog post explores the MobileLLM project, which aims to optimize sub-billion scale models for efficient on-device performance without compromising accuracy.

Improving Sub-Billion Scale LLM Design

In the quest to enhance the performance of sub-billion scale LLMs, the MobileLLM project undertakes a comprehensive design evolution. Starting from baseline models with 125M and 350M parameters, the project explores several model design techniques that are particularly beneficial for these smaller models:

Adopting SwiGLU FFN: The use of SwiGLU (Switchable Gated Linear Units) in the feed-forward network (FFN) has shown to improve model accuracy.
Forcing Lanky Architectures: Focusing on deep and thin architectures, which prioritize model depth over width, leads to better parameter utilization.
Embedding Sharing Methods: Techniques like input and output embedding sharing help reduce the parameter count without significant accuracy loss.
Grouped Query Attention: This method enhances attention mechanisms within the model, improving its overall performance.

These techniques collectively form a robust baseline model named MobileLLM. Further improvements are achieved through an immediate block-wise layer-sharing method, which enhances accuracy without additional memory overhead.

Training and Evaluation

The training of MobileLLM models was conducted on 32 A100 GPUs, using both exploratory and extensive training iterations. Initial exploratory experiments involved 120,000 iterations on 0.25 trillion tokens, which helped identify the most promising model configurations. These top models were subsequently trained using 480,000 iterations on 1 trillion tokens to fully leverage their potential.

The evaluation of the MobileLLM models was comprehensive, covering a range of zero-shot commonsense reasoning tasks, question answering, and reading comprehension benchmarks. For zero-shot commonsense reasoning, the models were tested on datasets such as ARC-easy and ARC-challenge (AI2 Reasoning Challenge), BoolQ (Boolean Questions), PIQA (Physical Interaction: Question Answering), SIQA (Social Interaction Question Answering), HellaSwag, OBQA (OpenBook Question Answering), and WinoGrande. These datasets collectively assess the model’s ability to handle a variety of reasoning scenarios, from basic factual questions to complex situational judgments.

Compatibility with Quantization

An essential aspect of optimizing LLMs for on-device use is ensuring compatibility with quantization techniques. The MobileLLM project tested per-token min-max post-training quantization (PTQ) on both 125M and 350M models. The results indicated only a modest accuracy reduction, confirming that these models could maintain high performance even when subjected to 8-bit weight and activation quantization.

Knowledge Distillation

To further enhance model efficiency, the project explored Knowledge Distillation (KD) techniques by utilizing larger models like LLaMA-v2 7B as teachers. KD involves transferring the knowledge from a larger, pre-trained teacher model to a smaller student model, thereby aiming to retain the accuracy and capabilities of the larger model while benefiting from the compactness of the smaller one. In this study, the KD loss was computed using the cross-entropy between the logits of the teacher and student models.

While implementing KD, the project team encountered significant training time overheads. Specifically, the training process experienced a slowdown by a factor of 2.6 to 3.2 times compared to traditional label-based training methods. Despite this increase in training time, the accuracy gains achieved through KD were comparable to those obtained via label-based training. This suggests that KD is a viable approach for training compact models, balancing the trade-off between training efficiency and model performance. The detailed results, as illustrated in Table 16 of the document, highlight the effectiveness of KD in maintaining high accuracy while reducing the model size, making it a promising technique for developing efficient, small-scale language models

On-Device Profiling

The true test of MobileLLM’s design came through on-device profiling. Using an iPhone 13, the project measured latency for loading, initialization, and execution of MobileLLM models. The findings showed that through effective weight-sharing and optimized layer structures, the models achieved minimal increases in latency, making them highly suitable for on-device applications.

Discussion

The advancements demonstrated by the MobileLLM project underline the potential for deploying efficient LLMs in memory-constrained environments. By meticulously optimizing model architecture and training techniques, MobileLLM achieves significant performance improvements without requiring the extensive computational resources typical of larger models. This work not only contributes to the field of LLM optimization but also paves the way for more accessible and energy-efficient AI applications across various devices.

Conclusion

The MobileLLM project represents a significant step forward in optimizing sub-billion scale models for on-device applications. Through innovative design choices and rigorous testing, these models have shown substantial improvements in various benchmarks, including zero-shot commonsense reasoning and API calling tasks. As the demand for efficient, powerful, and accessible AI continues to grow, the principles and techniques developed in this project will undoubtedly play a crucial role in the future of AI deployment.

9.21.2023

LLM compression

LLM pruning

Large language models (LLMs) consist of many components, but not all are essential for output. Such non-critical components can be pruned to maintain performance while reducing model size.

Unstructured Pruning:

Involves removing parameters without considering the model's structure.
Sets insignificant parameters to zero, creating a sparse model.
It is easy to implement but hard to optimize due to its random weight distribution.
Requires additional processing to compress and might need retraining.
Notable advancements include SparseGPT (eliminates retraining) and LoRAPrune (combines low-rank adaptation with pruning).

Structured Pruning:

Removes whole sections, like neurons or layers.
Simplifies model compression and boosts hardware efficiency.
Requires a deep understanding of the model and might significantly impact accuracy.
LLM-Pruner is a promising technique that uses gradient information to prune without relying heavily on original training data.
Both methods aim to optimize the balance between model size and performance.

LLM Knowledge Distillation

Knowledge distillation involves training a smaller "student" model to emulate a more complex "teacher" model, effectively creating a compact yet proficient model. In the context of LLMs, this technique has two primary categories:

Standard Knowledge Distillation:

Transfers the broad knowledge of the teacher to the student.
Can use prompts and responses from models like ChatGPT to train smaller LLMs, though there are constraints related to data from commercial models.
MiniLLM, developed by Tsinghua University and Microsoft Research, improves the process by using specialized objective and optimization functions, addressing the challenge of accurately capturing data distributions.

Emergent Ability Distillation:

Targets the transfer of a specific capability from the teacher model.
Examples include extracting math or reasoning skills from GPT-4 to a smaller model, such as Vicuna.
Focusing on a narrower task set makes measuring EA distillation easier, but it's essential to recognize the limitations in transferring emergent behaviors to smaller LLMs

LLM Quantization

Large Language Models (LLMs) like GPT-3 store parameters as floating-point values, with models like GPT-3 using hundreds of gigabytes of memory. To reduce this size, a technique called quantization is used, converting parameters to smaller integers.

Benefits of Quantization:

Allows LLMs to run on everyday devices.
Examples of quantized LLMs include GPT4All and Llama.cpp.

Quantization Approaches:

Quantization-Aware Training (QAT): Integrates quantization during training, allowing models to learn low-precision representations. The downside is it requires training from the beginning.
Quantization-Aware Fine-Tuning (QAFT): Adapts a pre-trained model for lower-precision weights. Techniques like QLoRA and PEQA are used in this approach.
Post-Training Quantization (PTQ): Reduces precision after training without changing the architecture. It's simple and efficient but might affect accuracy.

For an in-depth exploration of LLM compression, the paper "A Survey on Model Compression for Large Language Models" is recommended.