TL;DR: We developed a compiler that automatically transforms LLM inference into a single megakernel — a fused GPU kernel that performs all necessary computation and communication in one launch. This…

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

Researchers developed MPK (Mirage Persistent Kernel), a compiler that transforms LLM inference into a single GPU megakernel, achieving 1.2-6.7x latency reduction. The system eliminates kernel launch overhead by fusing all computation and communication operations into one persistent kernel, enables fine-grained software pipelining across layers, and overlaps computation with inter-GPU communication. MPK converts LLM computation graphs into optimized task graphs with sub-kernel level dependencies, then executes them using distributed on-GPU schedulers and workers within a single kernel context.

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

Part 1. The Compiler: Transforming an LLM into a Fine-Grained Task Graph

Part 2. The Runtime: Executing a Task Graph in a MegaKernel