Llama.cpp has taken a significant leap forward with the recent integration of RPC code, enabling distributed inference across multiple machines. This development marks a departure from the old MPI framework, paving the way for more flexible and efficient AI model deployment. In this blog post, we will explore the implications of this update, discuss its limitations, and provide a detailed guide on setting up distributed inference with Llama.cpp.
Overview of the Update
Overview of the Update
A few days ago, the RPC code by Georgi Gerganov was merged into Llama.cpp, and the old MPI code was removed. This means Llama.cpp now supports distributed inference, allowing models to run across multiple machines. Although this feature is still a work in progress, it shows great potential despite some limitations. Currently, it is restricted to FP16 with no quantization support and doesn’t work with Vulkan. However, even with these constraints, it performs admirably. The speed of inference is largely determined by network bandwidth, with a 1 gigabit Ethernet connection offering faster performance compared to slower Wi-Fi connections. Additionally, the overall speed is capped by the slowest machine in the setup.
Performance Metrics
To illustrate the performance of distributed inference, let’s examine the numbers between an M1 Max Studio and a PC with a 7900xtx using the Tiny Llama FP16 model. Here are the results with the Mac as the client:
Mac only:
• Prompt eval time: 199.23 ms / 508 tokens (0.39 ms per token, 2549.77 tokens per second)
• Eval time: 8423.24 ms / 511 runs (16.48 ms per token, 60.67 tokens per second)
7900xtx only:
• Prompt eval time: 100.50 ms / 508 tokens (0.20 ms per token, 5054.98 tokens per second)
• Eval time: 10574.48 ms / 511 runs (20.69 ms per token, 48.32 tokens per second)
Mac + 7900xtx:
• Prompt eval time: 230.29 ms / 508 tokens (0.45 ms per token, 2205.92 tokens per second)
• Eval time: 11147.19 ms / 511 runs (21.81 ms per token, 45.84 tokens per second)
When using the 7900xtx PC as the client, the performance metrics shift, further highlighting the impact of network speed:
Mac only:
• Prompt eval time: 253.78 ms / 508 tokens (0.50 ms per token, 2001.77 tokens per second)
• Eval time: 10627.55 ms / 511 runs (20.80 ms per token, 48.08 tokens per second)
7900xtx only:
• Prompt eval time: 40.93 ms / 508 tokens (0.08 ms per token, 12412.34 tokens per second)
• Eval time: 4249.10 ms / 511 runs (8.32 ms per token, 120.26 tokens per second)
Mac + 7900xtx:
• Prompt eval time: 198.44 ms / 508 tokens (0.39 ms per token, 2559.98 tokens per second)
• Eval time: 11117.95 ms / 511 runs (21.76 ms per token, 45.96 tokens per second)
The Bottleneck: Network Speed
The inference speed is notably limited by the network connection. For example, using Wi-Fi instead of Ethernet significantly reduces performance:
Mac over Wi-Fi:
• Prompt eval time: 737.93 ms / 508 tokens (1.45 ms per token, 688.41 tokens per second)
• Eval time: 42125.17 ms / 511 runs (82.44 ms per token, 12.13 tokens per second)
These results clearly show that network speed is a critical factor in distributed inference, with Ethernet providing up to 48 tokens per second (t/s) compared to just 12 t/s over Wi-Fi.
Conclusion
The integration of RPC code into Llama.cpp opens up new possibilities for distributed inference across multiple machines. Despite its current limitations, this feature shows promising results, significantly improving the flexibility and scalability of AI model deployment. By understanding the impact of network speed and following the setup guidelines, you can harness the power of distributed inference to enhance your AI projects.
No comments:
Post a Comment