Perplexity developed software optimizations enabling trillion-parameter mixture of experts (MoE) models to run efficiently on older, cheaper GPU hardware using AWS Elastic Fabric Adapter (EFA). The new kernels address memory and network latency challenges by optimizing communication between GPUs across multiple nodes, achieving

6m read time From go.theregister.com
Post cover image
Table of contents
Mo parameters, mo problemsCutting the EFA overhead

Sort: