Researchers developed MPK (Mirage Persistent Kernel), a compiler that transforms LLM inference into a single GPU megakernel, achieving 1.2-6.7x latency reduction. The system eliminates kernel launch overhead by fusing all computation and communication operations into one persistent kernel, enables fine-grained software pipelining across layers, and overlaps computation with inter-GPU communication. MPK converts LLM computation graphs into optimized task graphs with sub-kernel level dependencies, then executes them using distributed on-GPU schedulers and workers within a single kernel context.

9m read timeFrom zhihaojia.medium.com
Post cover image
Table of contents
Why MPK?What’s Next?Part 1. The Compiler: Transforming an LLM into a Fine-Grained Task GraphPart 2. The Runtime: Executing a Task Graph in a MegaKernelLooking ForwardWant to Learn More?

Sort: