The post details building a fast LLM inference engine from scratch using C++ and CUDA without relying on libraries. It focuses on single-GPU inference, optimizing token throughput, and surpassing llama.cpp performance. Key highlights include understanding LLM architectures, inference mechanics, and memory bandwidth bottlenecks

40m read timeFrom andrewkchan.dev
Post cover image
Table of contents
Fast LLM Inference From Scratch1. Recap: LLM architectures and inference2. Inference on the CPU3. Inference on the GPU4. What’s next

Sort: