AILAB Blog: DeepSeek Open-Source Week

⫸ FlashMLA

Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production.

✅ BF16 support

✅ Paged KV cache (block size 64)

⚡ 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800

🔗 GitHub: https://github.com/deepseek-ai/FlashMLA

⫸ DeepEP

Excited to introduce DeepEP - the first open-source EP communication library for MoE model training and inference.

✅ Efficient and optimized all-to-all communication

✅ Both intranode and internode support with NVLink and RDMA

✅ High-throughput kernels for training and inference prefilling

✅ Low-latency kernels for inference decoding

✅ Native FP8 dispatch support

✅ Flexible GPU resource control for computation-communication overlapping

🔗 GitHub: https://github.com/deepseek-ai/DeepEP

⫸ DeepGEMM

Introducing DeepGEMM - an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference.

⚡ Up to 1350+ FP8 TFLOPS on Hopper GPUs

✅ No heavy dependency, as clean as a tutorial

✅ Fully Just-In-Time compiled

✅ Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes

✅ Supports dense layout and two MoE layouts

🔗 GitHub: https://github.com/deepseek-ai/DeepGEMM

⫸ Optimized Parallelism Strategies

✅ DualPipe - a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

🔗 GitHub: https://github.com/deepseek-ai/DualPipe

✅ EPLB - an expert-parallel load balancer for V3/R1.

🔗 GitHub: https://github.com/deepseek-ai/eplb

✅ Analyze computation-communication overlap in V3/R1.

🔗 GitHub: https://github.com/deepseek-ai/profile-data

⫸ 3FS, Thruster for All DeepSeek Data Access

Fire-Flyer File System (3FS) - a parallel file system that utilizes the full bandwidth of modern SSDs and RDMA networks.

⚡ 6.6 TiB/s aggregate read throughput in a 180-node cluster

⚡ 3.66 TiB/min throughput on GraySort benchmark in a 25-node cluster

⚡ 40+ GiB/s peak throughput per client node for KVCache lookup

🧬 Disaggregated architecture with strong consistency semantics

✅ Training data preprocessing, dataset loading, checkpoint saving/reloading, embedding vector search & KVCache lookups for inference in V3/R1

📥 3FS → https://github.com/deepseek-ai/3FS

⛲ Smallpond → https://github.com/deepseek-ai/smallpond

⫸ DeepSeek-V3/R1 Inference System Overview

Optimized throughput and latency via:

🔧 Cross-node EP-powered batch scaling

🔄 Computation-communication overlap

⚖️ Load balancing

Statistics of DeepSeek's Online Service:

⚡ 73.7k/14.8k input/output tokens per second per H800 node

🚀 Cost profit margin 545%

💡 We hope this week's insights offer value to the community and contribute to our shared AGI goals.

📖 Deep Dive: https://bit.ly/4ihZUiO

AILAB Blog

3.05.2025

DeepSeek Open-Source Week

No comments:

Post a Comment