AILAB Blog: Quantization

Showing posts with label Quantization. Show all posts

10.26.2024

Optimizing Sub-Billion Scale Models for On-Device Applications: The MobileLLM Approach

Introduction

The proliferation of large language models (LLMs) has revolutionized numerous aspects of human interaction with technology. These models, often comprising billions of parameters, have demonstrated remarkable capabilities in understanding and generating human language. However, their deployment is often constrained by the substantial computational resources they demand, making them less suitable for on-device applications where memory and processing power are limited. This blog post explores the MobileLLM project, which aims to optimize sub-billion scale models for efficient on-device performance without compromising accuracy.

Improving Sub-Billion Scale LLM Design

In the quest to enhance the performance of sub-billion scale LLMs, the MobileLLM project undertakes a comprehensive design evolution. Starting from baseline models with 125M and 350M parameters, the project explores several model design techniques that are particularly beneficial for these smaller models:

Adopting SwiGLU FFN: The use of SwiGLU (Switchable Gated Linear Units) in the feed-forward network (FFN) has shown to improve model accuracy.
Forcing Lanky Architectures: Focusing on deep and thin architectures, which prioritize model depth over width, leads to better parameter utilization.
Embedding Sharing Methods: Techniques like input and output embedding sharing help reduce the parameter count without significant accuracy loss.
Grouped Query Attention: This method enhances attention mechanisms within the model, improving its overall performance.

These techniques collectively form a robust baseline model named MobileLLM. Further improvements are achieved through an immediate block-wise layer-sharing method, which enhances accuracy without additional memory overhead.

Training and Evaluation

The training of MobileLLM models was conducted on 32 A100 GPUs, using both exploratory and extensive training iterations. Initial exploratory experiments involved 120,000 iterations on 0.25 trillion tokens, which helped identify the most promising model configurations. These top models were subsequently trained using 480,000 iterations on 1 trillion tokens to fully leverage their potential.

The evaluation of the MobileLLM models was comprehensive, covering a range of zero-shot commonsense reasoning tasks, question answering, and reading comprehension benchmarks. For zero-shot commonsense reasoning, the models were tested on datasets such as ARC-easy and ARC-challenge (AI2 Reasoning Challenge), BoolQ (Boolean Questions), PIQA (Physical Interaction: Question Answering), SIQA (Social Interaction Question Answering), HellaSwag, OBQA (OpenBook Question Answering), and WinoGrande. These datasets collectively assess the model’s ability to handle a variety of reasoning scenarios, from basic factual questions to complex situational judgments.

Compatibility with Quantization

An essential aspect of optimizing LLMs for on-device use is ensuring compatibility with quantization techniques. The MobileLLM project tested per-token min-max post-training quantization (PTQ) on both 125M and 350M models. The results indicated only a modest accuracy reduction, confirming that these models could maintain high performance even when subjected to 8-bit weight and activation quantization.

Knowledge Distillation

To further enhance model efficiency, the project explored Knowledge Distillation (KD) techniques by utilizing larger models like LLaMA-v2 7B as teachers. KD involves transferring the knowledge from a larger, pre-trained teacher model to a smaller student model, thereby aiming to retain the accuracy and capabilities of the larger model while benefiting from the compactness of the smaller one. In this study, the KD loss was computed using the cross-entropy between the logits of the teacher and student models.

While implementing KD, the project team encountered significant training time overheads. Specifically, the training process experienced a slowdown by a factor of 2.6 to 3.2 times compared to traditional label-based training methods. Despite this increase in training time, the accuracy gains achieved through KD were comparable to those obtained via label-based training. This suggests that KD is a viable approach for training compact models, balancing the trade-off between training efficiency and model performance. The detailed results, as illustrated in Table 16 of the document, highlight the effectiveness of KD in maintaining high accuracy while reducing the model size, making it a promising technique for developing efficient, small-scale language models

On-Device Profiling

The true test of MobileLLM’s design came through on-device profiling. Using an iPhone 13, the project measured latency for loading, initialization, and execution of MobileLLM models. The findings showed that through effective weight-sharing and optimized layer structures, the models achieved minimal increases in latency, making them highly suitable for on-device applications.

Discussion

The advancements demonstrated by the MobileLLM project underline the potential for deploying efficient LLMs in memory-constrained environments. By meticulously optimizing model architecture and training techniques, MobileLLM achieves significant performance improvements without requiring the extensive computational resources typical of larger models. This work not only contributes to the field of LLM optimization but also paves the way for more accessible and energy-efficient AI applications across various devices.

Conclusion

The MobileLLM project represents a significant step forward in optimizing sub-billion scale models for on-device applications. Through innovative design choices and rigorous testing, these models have shown substantial improvements in various benchmarks, including zero-shot commonsense reasoning and API calling tasks. As the demand for efficient, powerful, and accessible AI continues to grow, the principles and techniques developed in this project will undoubtedly play a crucial role in the future of AI deployment.

5.21.2024

Exploring GGUF and GGML

In the ever-evolving world of technology, especially within the domain of Large Language Models (LLMs), efficiency and performance optimization are key. The recent introduction of GGUF, standing for "GPT-Generated Unified Format," marks a significant advancement in the way we interact with and deploy LLMs. This breakthrough, pioneered by the llama.cpp team, has set a new standard for quantized models, rendering its predecessor, GGML, a stepping stone in the journey toward more accessible and efficient model formats.

The Evolution from GGML to GGUF

Originally, GGML (a C++ Tensor library) was designed to facilitate the operation of LLMs on various computational platforms, including CPUs alone or in combination with GPUs. However, on August 21, 2023, llama.cpp introduced GGUF as a superior replacement. GGUF not only retains the ability to run models on a CPU and offload certain layers to a GPU for enhanced performance but also introduces several groundbreaking features.

One of the key innovations of GGUF is its unified file format, which integrates all necessary metadata directly into the model file. This development simplifies the deployment and operation of LLMs by eliminating the need for additional files, such as tokenizer_config.json, that were previously required. Moreover, llama.cpp has developed a tool to convert .safetensors model files into the .gguf format, further facilitating the transition to this more efficient system.

Compatibility and Performance

GGUF's design is not only about efficiency but also about compatibility and future-proofing. Its architecture allows for the running of LLMs on CPUs, GPUs, and MPUs, supporting multi-threaded inference for improved performance. Additionally, the format has been designed to be extensible, ensuring that future enhancements and features can be integrated without disrupting compatibility with existing models.

Quantization: A Comparative Overview

While GGUF/GGML and GPTQ might seem similar at first glance, it's crucial to understand their differences. GPTQ employs a post-training quantization method to compress LLMs, significantly reducing the memory footprint of models like GPT by approximating weights layer by layer. This approach differs fundamentally from GGUF/GGML's method, which focuses on operational efficiency and flexibility in deployment scenarios.

Looking Ahead

The transition from GGML to GGUF is not merely a technical update but a reflection of the continuous pursuit of optimization in the field of artificial intelligence. By centralizing metadata and enhancing compatibility and performance, GGUF sets a new benchmark for future developments in LLM deployment and utilization.

As the landscape of LLMs continues to grow, the importance of formats like GGUF will only increase. Their ability to make powerful models more accessible and efficient will play a crucial role in democratizing the benefits of artificial intelligence, opening new avenues for innovation and application across various sectors.

2.11.2024

Large Language Model Course

The "Large Language Model (LLM) Course" on GitHub by Maxime Labonne is a treasure trove for anyone interested in diving deep into the world of LLMs. This meticulously crafted course is designed to guide learners through the essentials of Large Language Models, leveraging Colab notebooks and detailed roadmaps to provide a hands-on learning experience. Here's a glimpse of what the course offers:

LLM Fundamentals: The course begins with the basics, covering crucial mathematical concepts, Python programming, and the foundations of neural networks. It ensures that learners have the necessary groundwork to delve deeper into the subject.
The LLM Scientist and Engineer: The curriculum is cleverly divided into two tracks – one for those aiming to master the science behind building state-of-the-art LLMs and another for those interested in engineering LLM-based applications and solutions.
Hands-on Learning: With a rich collection of notebooks, the course provides practical experience in fine-tuning, quantization, and deploying LLMs. From fine-tuning Llama 2 in Google Colab to exploring quantization techniques for optimizing model performance, learners can get their hands dirty with real-world applications.
Comprehensive Coverage: Topics range from the very basics of machine learning and Python to advanced areas like neural network training, natural language processing (NLP), and beyond. The course also dives into specific LLM applications, offering insights into decoding strategies, model quantization, and even how to enhance ChatGPT with knowledge graphs.
Accessible and User-Friendly: Designed with the learner in mind, the course materials are accessible to both beginners and advanced users, with Colab notebooks simplifying the execution of complex codes and experiments.

This course stands out as a comprehensive guide for anyone looking to explore the expansive realm of LLMs, from academic enthusiasts to industry professionals. Whether you're aiming to understand the theoretical underpinnings or seeking to apply LLMs in practical scenarios, this course offers the resources and guidance needed to embark on or advance your journey in the field of artificial intelligence.

For more details, visit the LLM Course on GitHub.

10.02.2023

EfficientML.ai Lecture, Fall 2023, MIT 6.5940

Large generative models (e.g., large language models, diffusion models) have shown remarkable performances, but their enormous scale demands significant computation and memory resources. To make them more accessible, it is crucial to improve their efficiency. This course will introduce efficient deep learning computing techniques that enable powerful deep learning applications on resource-constrained devices. Topics include model compression, pruning, quantization, neural architecture search, distributed training, data/model parallelism, gradient compression, and on-device fine-tuning. It also introduces application-specific acceleration techniques for large language models, diffusion models, video recognition, and point cloud. This course will also cover topics about quantum machine learning. Students will get hands-on experience deploying large language models (e.g., LLaMA 2) on the laptop.

9.21.2023

LLM compression

LLM pruning

Large language models (LLMs) consist of many components, but not all are essential for output. Such non-critical components can be pruned to maintain performance while reducing model size.

Unstructured Pruning:

Involves removing parameters without considering the model's structure.
Sets insignificant parameters to zero, creating a sparse model.
It is easy to implement but hard to optimize due to its random weight distribution.
Requires additional processing to compress and might need retraining.
Notable advancements include SparseGPT (eliminates retraining) and LoRAPrune (combines low-rank adaptation with pruning).

Structured Pruning:

Removes whole sections, like neurons or layers.
Simplifies model compression and boosts hardware efficiency.
Requires a deep understanding of the model and might significantly impact accuracy.
LLM-Pruner is a promising technique that uses gradient information to prune without relying heavily on original training data.
Both methods aim to optimize the balance between model size and performance.

LLM Knowledge Distillation

Knowledge distillation involves training a smaller "student" model to emulate a more complex "teacher" model, effectively creating a compact yet proficient model. In the context of LLMs, this technique has two primary categories:

Standard Knowledge Distillation:

Transfers the broad knowledge of the teacher to the student.
Can use prompts and responses from models like ChatGPT to train smaller LLMs, though there are constraints related to data from commercial models.
MiniLLM, developed by Tsinghua University and Microsoft Research, improves the process by using specialized objective and optimization functions, addressing the challenge of accurately capturing data distributions.

Emergent Ability Distillation:

Targets the transfer of a specific capability from the teacher model.
Examples include extracting math or reasoning skills from GPT-4 to a smaller model, such as Vicuna.
Focusing on a narrower task set makes measuring EA distillation easier, but it's essential to recognize the limitations in transferring emergent behaviors to smaller LLMs

LLM Quantization

Large Language Models (LLMs) like GPT-3 store parameters as floating-point values, with models like GPT-3 using hundreds of gigabytes of memory. To reduce this size, a technique called quantization is used, converting parameters to smaller integers.

Benefits of Quantization:

Allows LLMs to run on everyday devices.
Examples of quantized LLMs include GPT4All and Llama.cpp.

Quantization Approaches:

Quantization-Aware Training (QAT): Integrates quantization during training, allowing models to learn low-precision representations. The downside is it requires training from the beginning.
Quantization-Aware Fine-Tuning (QAFT): Adapts a pre-trained model for lower-precision weights. Techniques like QLoRA and PEQA are used in this approach.
Post-Training Quantization (PTQ): Reduces precision after training without changing the architecture. It's simple and efficient but might affect accuracy.

For an in-depth exploration of LLM compression, the paper "A Survey on Model Compression for Large Language Models" is recommended.