AILAB Blog: Mixture-of-Experts

Imagine an AI that can read a million-word document in one go, analyze a series of images alongside your text prompts, and still outsmart some of the biggest names in the game—all while being freely available for anyone to download. Sounds like science fiction? Well, Meta has just turned this into reality with the launch of the Llama 4 suite of models, unveiled on April 5, 2025. This isn’t just an upgrade; it’s a revolution in artificial intelligence, blending speed, efficiency, and multimodal magic into a trio of models that are already making waves: Llama 4 Scout, Llama 4 Maverick, and the colossal Llama 4 Behemoth.

Meet the Llama 4 Herd

Meta’s latest lineup is a masterclass in diversity and power. Here’s the breakdown:

Llama 4 Scout: Think of it as the nimble trailblazer. With 17 billion active parameters and 109 billion total parameters across 16 experts, it’s built for speed and optimized for inference. Its standout feature? An industry-leading 10 million token context length—perfect for tackling massive datasets like entire codebases or sprawling novels without breaking a sweat.
Llama 4 Maverick: The multitasking marvel. Also boasting 17 billion active parameters but with a whopping 128 experts and 400 billion total parameters, this model is natively multimodal, seamlessly blending text and images. It handles a 1 million token context length and delivers top-tier performance at a fraction of the cost of its rivals.
Llama 4 Behemoth: The heavyweight champion still in training. With 288 billion active parameters and 2 trillion total parameters across 16 experts, it’s the brain behind the operation, serving as a teacher model to refine its smaller siblings. Early benchmarks show it outperforming giants like GPT-4.5 and Claude Sonnet 3.7 in STEM tasks.

What’s even better? Scout and Maverick are open-weight and available for download right now on llama.com and Hugging Face, while Behemoth promises to be a game-changer once it’s fully trained.

Why Llama 4 Stands Out

So, what makes these models the talk of the AI world? Let’s dive into the key features that set Llama 4 apart:

Mixture-of-Experts (MoE) Architecture
Forget the old-school approach where every parameter works on every task. Llama 4 uses a mixture-of-experts (MoE) design, activating only a fraction of its parameters for each input. For example, Maverick’s 400 billion parameters slim down to 17 billion in action, slashing costs and boosting speed. It’s like having a team of specialists instead of a jack-of-all-trades—efficiency without compromise.
Native Multimodality
These models don’t just read text—they see images and videos too. Thanks to early fusion, Llama 4 integrates text and vision tokens from the ground up, trained on a massive dataset of over 30 trillion tokens, including text, images, and video stills. Need an AI to analyze a photo and write a description? Maverick’s got you covered.
Mind-Blowing Context Lengths
Context is king, and Llama 4 wears the crown. Scout handles up to 10 million tokens, while Maverick manages 1 million. That’s enough to process entire books, lengthy legal documents, or complex code repositories in one go. The secret? Innovations like the iRoPE architecture, blending interleaved attention layers and rotary position embeddings for “infinite” context potential.
Unmatched Performance
Numbers don’t lie. Maverick beats out GPT-40 and Gemini 2.0 on benchmarks like coding, reasoning, and image understanding, all while costing less to run. Scout outperforms peers like Llama 3.3 70B and Mistral 3.1 24B in its class. And Behemoth? It’s already topping STEM charts, leaving Claude Sonnet 3.7 and GPT-4.5 in the dust.
Distillation from a Titan
The smaller models owe their smarts to Behemoth, which uses a cutting-edge co-distillation process to pass down its wisdom. This teacher-student dynamic ensures Scout and Maverick punch above their weight, delivering high-quality results without the computational heft.

Built with Care: Safety and Fairness

Meta isn’t just chasing performance—they’re committed to responsibility. Llama 4 comes with robust safety measures woven into every layer, from pre-training data filters to post-training tools like Llama Guard (for detecting harmful content) and Prompt Guard (to spot malicious inputs). They’ve also tackled bias head-on, reducing refusal rates on debated topics from 7% in Llama 3 to below 2% in Llama 4, and cutting political lean by half compared to its predecessor. The result? An AI that’s more balanced and responsive to all viewpoints.

How They Made It Happen

Behind the scenes, Llama 4’s creation is a feat of engineering:

Pre-training: A 30 trillion token dataset—double that of Llama 3—mixed with text, images, and videos, powered by FP8 precision and 32K GPUs for efficiency.
Post-training: A revamped pipeline with lightweight supervised fine-tuning (SFT), online reinforcement learning (RL), and direct preference optimization (DPO) to boost reasoning, coding, and math skills.
Innovations: Techniques like MetaP for hyperparameter tuning and mid-training to extend context lengths ensure these models are both powerful and practical.

The Bottom Line

Llama 4 isn’t just another AI model—it’s a bold step into the future. Its blend of multimodal intelligence, unprecedented efficiency, and open accessibility makes it a playground for developers, a tool for businesses, and a marvel for anyone curious about AI’s potential. Whether you’re coding the next big app, analyzing vast datasets, or exploring creative AI frontiers, Llama 4 has something extraordinary to offer.

Mistral AI, on its steadfast mission to empower the developer community with cutting-edge open models, proudly presents Mixtral 8x7B—a high-quality sparse mixture of expert models (SMoE) with open weights. Under the Apache 2.0 license, Mixtral outshines benchmarks, surpassing Llama 2 70B with 6x faster inference and offering the best cost/performance trade-offs. This open-weight model proves to be a formidable competitor, even outperforming GPT3.5 on various standard benchmarks.

Mixtral Highlights:

Handles a context of 32k tokens with grace.
Multilingual capabilities: English, French, Italian, German, and Spanish.
Demonstrates robust performance in code generation.
Achieved an impressive score of 8.3 on MT-Bench as an instruction-following model.
Pushing the Frontier of Open Models with Sparse Architectures

Mixtral is a decoder-only model utilizing a sparse mixture-of-experts network. With a unique feedforward block, it selects from 8 distinct parameter groups, enhancing model parameters while efficiently managing cost and latency. Despite its 46.7B total parameters, Mixtral utilizes only 12.9B parameters per token, maintaining processing speed and cost-effectiveness comparable to a 12.9B model.

Performance Comparison

Mixtral outshines Llama 2 70B and GPT3.5 across various benchmarks, offering a superior quality versus inference budget tradeoff. Detailed benchmarks reveal Mixtral's truthfulness and reduced biases compared to Llama 2, making it a strong contender in the open-source model landscape.

Instructed Models

Mistral introduces Mixtral 8x7B Instruct, optimized for careful instruction following. Scoring 8.30 on MT-Bench, it stands as the best open-source model, rivaling the performance of GPT3.5. Mistral can be fine-tuned to ban specific outputs, ensuring moderation in applications that demand it.

Open-Source Deployment Stack

To facilitate community usage, Mistral AI contributes changes to the vLLM project, integrating Megablocks CUDA kernels for efficient inference. Skypilot enables the deployment of vLLM endpoints on any cloud instance, providing accessibility to Mixtral.

Experience Mixtral on Our Platform

Mistral AI currently deploys Mixtral 8x7B behind the mistral-small endpoint, which is available in beta. Register now for early access to all generative and embedding endpoints.

Acknowledgments

Mistral AI extends gratitude to CoreWeave and Scaleway teams for their invaluable technical support during model training.

AILAB Blog

4.05.2025

Meta’s Llama 4: A New Era of Multimodal AI Innovation

12.17.2023

Introducing Mixtral 8x7B: Mistral AI's Breakthrough Sparse Mixture-of-Experts Model