AILAB Blog: Mixture-of-Experts (MoE)

Showing posts with label Mixture-of-Experts (MoE). Show all posts

10.20.2025

DeepSeek-OCR is Not About OCR

You read that right. The new paper and model from DeepSeek, titled "DeepSeek-OCR," is one of the most exciting developments in AI this year, but its true innovation has almost nothing to do with traditional Optical Character Recognition.

The project’s real goal is to solve one of the biggest problems in large language models: the context window.

This post is a technical deep dive into what DeepSeek-OCR really is—a revolutionary method for text compression that uses vision to give LLMs a near-infinite memory.

The Core Problem: The Token Bottleneck

Large Language Models (LLMs) are limited by their context window, or how much information they can "remember" at one time. This limit exists because text is processed in "tokens," which roughly equate to a word or part of a word. A 1 million token context window, while massive, still fills up. Processing 10 million tokens is computationally and financially staggering.

The challenge is: how can you feed a model a 10-page document, or your entire chat history, without running out of space?

The Solution: "Contexts Optical Compression"

DeepSeek's answer is brilliantly simple: stop thinking about text as text, and start thinking about it as an image.

The paper's real title, "DeepSeek-OCR: Contexts Optical Compression," says it all. The goal is not to just read text in an image (OCR), but to store text as an image.

This new method can take 1,000 text tokens, render them as an image, and compress that image into just 100 vision tokens. This "optical" representation can then be fed to a model, achieving a 10x compression ratio with ~97% accuracy. At 20x compression (50 vision tokens for 1,000 text tokens), it still retains 60% accuracy.

Imagine an AI that, instead of storing your long conversation history as a text file, "remembers" it as a series of compressed images. This is a new form of AI memory.

Technical Deep Dive: The Architecture

So, how does it work? The system is composed of two primary components: a novel DeepEncoder for compression and an efficient MoE Decoder for reconstruction.

1. The DeepEncoder: The "Secret Sauce"

This isn't a standard vision encoder. It’s a highly specialized, 380-million-parameter system built in two stages to be both incredibly detailed and highly efficient.

Stage 1: Local Analysis (SAM) The encoder first uses a SAM (Segment Anything Model), a powerful 80-million-parameter model from Meta. SAM's job is to analyze the image at a high resolution and understand all the fine-grained, local details—essentially figuring out "what to pay attention to."
The Compressor (16x CNN) This is the key to its efficiency. The output from SAM, which would normally be a huge number of tokens, is immediately passed through a 16x convolutional neural network (CNN). This network acts as a compressor, shrinking the token count by 16 times before the next, more computationally expensive stage. For example, a 1024x1024 image patch (which might start as 4,096 tokens) is compressed down to just 256 tokens.
Stage 2: Global Context (CLIP) These 256 compressed tokens are then fed into a CLIP ViT-300M, a 300-million-parameter model from OpenAI. CLIP’s job is to use global attention to understand how all these small pieces relate to each other, creating a rich, efficient summary of the entire image.

This multi-stage design is brilliant because it uses the lightweight SAM model for the high-resolution "grunt work" and the heavy-duty CLIP model only on the compressed data.

2. The Decoder: The "Reader"

Once the image is compressed into a small set of vision tokens, it needs to be read. This is handled by a DeepSeek-3B-MoE (Mixture-of-Experts) decoder.

While the model has 3 billion total parameters, it uses an MoE architecture. This means that for any given token, it only activates a fraction of its "experts." In this case, only ~570 million active parameters (e.g., 6 out of 64 experts) are used during inference. This makes the decoder incredibly fast and efficient while maintaining high performance.

Performance and "Gundam Mode"

This architecture is not just theoretical; it achieves state-of-the-art results. On benchmarks like OmniDocBench, DeepSeek-OCR outperforms other models while using a fraction of the tokens. For instance, it can achieve better performance with <800 vision tokens than a competing model, MinerU 2.0, which required over 6,000 tokens for the same page.

The model is also versatile, offering different modes to balance performance and token count:

Tiny Mode: 64 vision tokens
Small Mode: 100 vision tokens
Base Mode: 256 vision tokens
Large Mode: 400 vision tokens
Gundam Mode: A dynamic mode that can use up to ~1,800 tokens for extremely complex documents.

The Big Picture: The Future is "Optical Memory"

This paper is so much more than just an OCR paper. DeepSeek has proven that vision can be a highly efficient compression layer for language.

This opens the door to a new paradigm for AI systems. We can now build models with "optical memory," where long-term context is stored visually. This could even mimic human memory, where older memories are not lost, but become "blurrier" or more compressed over time.

DeepSeek-OCR isn't just a new tool; it's a fundamental shift in how we think about AI, memory, and the "thousand words" a single picture is truly worth.

6.26.2024

DeepSeek-Coder-V2: Open-Source Code Intelligence

Introduction

The field of code intelligence has seen remarkable advancements through the open-source community, with models like StarCoder, CodeLlama, and DeepSeek-Coder making significant strides. However, these models have yet to reach the performance levels of their closed-source counterparts such as GPT4-Turbo and Claude 3 Opus. Enter DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model designed to bridge this gap. Built on the foundation of DeepSeek-V2, DeepSeek-Coder-V2 undergoes further pre-training with an additional 6 trillion tokens, significantly enhancing its coding and mathematical reasoning capabilities while supporting 338 programming languages and extending context length to 128K tokens.

Enhanced Capabilities

DeepSeek-Coder-V2 stands out with its substantial improvements in various code-related tasks, achieving superior performance compared to closed-source models like GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro. This model excels in benchmarks such as HumanEval, MBPP+, MATH, and GSM8K, demonstrating its prowess in both coding and math tasks. The extensive pre-training dataset, comprising 60% source code, 10% math corpus, and 30% natural language corpus, has been meticulously curated and expanded, resulting in significant accuracy improvements in benchmarks.

Training and Alignment

The training process of DeepSeek-Coder-V2 involves a combination of Next-Token-Prediction and Fill-In-Middle (FIM) objectives, particularly for the 16B parameter model. The FIM approach structures content reconstruction in a specific sequence, enhancing training efficacy and model performance. Additionally, the alignment phase incorporates Group Relative Policy Optimization (GRPO) to align the model's behavior with human preferences, using compiler feedback and test cases to optimize the model's responses for correctness and user satisfaction.

Contributions and Evaluations

DeepSeek-Coder-V2's contributions to the field of code intelligence are manifold. It introduces the first open-source hundred-billion-parameter code model, demonstrating significant advancements over state-of-the-art closed-source models. With a permissive license, DeepSeek-Coder-V2 is publicly available for both research and unrestricted commercial use, promoting further innovation and development in the field. Evaluation results highlight its superiority in code generation and mathematical reasoning, rivaling top closed-source models and setting new benchmarks in various evaluations.

Conclusion

The introduction of DeepSeek-Coder-V2 marks a significant milestone in the evolution of open-source code intelligence. With its enhanced capabilities, extensive training, and public availability, DeepSeek-Coder-V2 paves the way for further advancements in the field, providing a powerful tool for developers and researchers alike. As open-source models continue to close the gap with their closed-source counterparts, DeepSeek-Coder-V2 stands as a testament to the potential of collaborative innovation in the realm of code intelligence.

5.07.2024

Inside DeepSeek-V2's Advanced Language Model Architecture

Introduction to DeepSeek-V2

In the rapidly evolving world of artificial intelligence, the quest for more powerful and efficient language models is ceaseless. DeepSeek-V2 emerges as a pioneering solution, introducing a robust Mixture-of-Experts (MoE) architecture that marries economical training with high-efficiency inference. This model boasts a staggering 236 billion parameters, yet optimizes resource use by activating only 21 billion parameters per token. This design not only enhances performance but also significantly cuts down on both the training costs and the memory footprint during operation.

Revolutionary Architectural Enhancements

DeepSeek-V2 leverages cutting-edge architectural enhancements that redefine how large language models operate. At its core are two pivotal technologies: Multi-head Latent Attention (MLA) and the DeepSeekMoE framework. MLA streamlines the key-value cache mechanism, reducing its size by over 93%, which greatly speeds up inference times without sacrificing accuracy. On the other hand, DeepSeekMoE facilitates the training of powerful models by employing a sparse computation strategy that allows for more targeted and efficient parameter use.

Training Economies and Efficiency

One of the standout features of DeepSeek-V2 is its ability to reduce training costs by an impressive 42.5%. This is achieved through innovative optimizations that minimize the number of computations needed during training. Furthermore, DeepSeek-V2 supports an extended context length of up to 128,000 tokens, which is a significant leap over traditional models, making it adept at handling complex tasks that require deeper contextual understanding.

Pre-training and Fine-Tuning

DeepSeek-V2 was pretrained on a diverse, high-quality multi-source corpus that includes a substantial increase in the volume of data, particularly in Chinese. This corpus now totals over 8.1 trillion tokens, providing a rich dataset that significantly contributes to the model’s robustness and versatility. Following pretraining, the model underwent Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), further enhancing its alignment with human-like conversational capabilities and preferences.

Comparative Performance and Future Applications

In benchmarks, DeepSeek-V2 stands out for its superior performance across multiple languages and tasks, outperforming its predecessors and other contemporary models. It offers compelling improvements in training and inference efficiency that make it a valuable asset for a range of applications, from automated customer service to sophisticated data analysis tasks. Looking ahead, the potential applications of DeepSeek-V2 in areas like real-time multilingual translation and automated content generation are incredibly promising.

Conclusion and Forward Look

DeepSeek-V2 represents a significant advancement in the field of language models. Its innovative architecture and cost-effective training approach set new standards for what is possible in AI technologies. As we look to the future, the ongoing development of models like DeepSeek-V2 will continue to push the boundaries of machine learning, making AI more accessible and effective across various industries.

Model

DeepSeek-V2-Chat