Perplexity developed software optimizations enabling trillion-parameter mixture of experts (MoE) models to run efficiently on older, cheaper GPU hardware using AWS Elastic Fabric Adapter (EFA). The new kernels address memory and network latency challenges by optimizing communication between GPUs across multiple nodes, achieving lower latency than existing solutions like DeepSeek's DeepEP. Testing on AWS H200 instances with models like DeepSeek V3 and Kimi K2 showed substantial performance gains at medium batch sizes, making distributed inference on MoE models viable without expensive GB200/GB300 systems. The optimizations are open-sourced on GitHub.

6m read timeFrom go.theregister.com
Post cover image
Table of contents
Mo parameters, mo problemsCutting the EFA overhead

Sort: