3.05.2025

DeepSeek Open-Source Week

DeepSeek Open-Source Week

FlashMLA

Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production.


✅ BF16 support

✅ Paged KV cache (block size 64)

⚡ 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800

🔗 GitHub: https://github.com/deepseek-ai/FlashMLA



DeepEP


Excited to introduce DeepEP - the first open-source EP communication library for MoE model training and inference.


✅ Efficient and optimized all-to-all communication

✅ Both intranode and internode support with NVLink and RDMA

✅ High-throughput kernels for training and inference prefilling

✅ Low-latency kernels for inference decoding

✅ Native FP8 dispatch support

✅ Flexible GPU resource control for computation-communication overlapping

🔗 GitHub: https://github.com/deepseek-ai/DeepEP



DeepGEMM


Introducing DeepGEMM - an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference.


⚡ Up to 1350+ FP8 TFLOPS on Hopper GPUs

✅ No heavy dependency, as clean as a tutorial

✅ Fully Just-In-Time compiled

✅ Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes

✅ Supports dense layout and two MoE layouts

🔗 GitHub: https://github.com/deepseek-ai/DeepGEMM



Optimized Parallelism Strategies


✅ DualPipe - a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

🔗 GitHub: https://github.com/deepseek-ai/DualPipe


✅ EPLB - an expert-parallel load balancer for V3/R1.

🔗 GitHub: https://github.com/deepseek-ai/eplb


✅ Analyze computation-communication overlap in V3/R1.

🔗 GitHub: https://github.com/deepseek-ai/profile-data



3FS, Thruster for All DeepSeek Data Access


Fire-Flyer File System (3FS) - a parallel file system that utilizes the full bandwidth of modern SSDs and RDMA networks.


⚡ 6.6 TiB/s aggregate read throughput in a 180-node cluster

⚡ 3.66 TiB/min throughput on GraySort benchmark in a 25-node cluster

⚡ 40+ GiB/s peak throughput per client node for KVCache lookup

🧬 Disaggregated architecture with strong consistency semantics

✅ Training data preprocessing, dataset loading, checkpoint saving/reloading, embedding vector search & KVCache lookups for inference in V3/R1


📥 3FS → https://github.com/deepseek-ai/3FS

⛲ Smallpond → https://github.com/deepseek-ai/smallpond



DeepSeek-V3/R1 Inference System Overview


Optimized throughput and latency via:

🔧 Cross-node EP-powered batch scaling

🔄 Computation-communication overlap

⚖️ Load balancing


Statistics of DeepSeek's Online Service:

⚡ 73.7k/14.8k input/output tokens per second per H800 node

🚀 Cost profit margin 545%


💡 We hope this week's insights offer value to the community and contribute to our shared AGI goals.

📖 Deep Dive: https://bit.ly/4ihZUiO

No comments:

Post a Comment