A CppCon talk walking through progressive optimizations for dense matrix multiplication in C++, targeting near-peak CPU performance. Starting from a naive triple-loop implementation (152s), the speaker applies loop reordering, cache tiling, manual SIMD vectorization with AVX intrinsics, cache-aware blocking tuned to L1/L2 sizes, explicit register allocation, multithreading, tiled matrix layout repacking, C++26 SIMD library, loop unrolling via template metaprogramming, and software prefetching. Each step is benchmarked with Google Benchmark and compared against OpenBLAS. On an AMD Zen 5 (9950X), the final implementation beats OpenBLAS for doubles and achieves ~3 teraflops for bfloat16. The talk also covers assembly analysis with perf and VTune, compiler differences between GCC and Clang (up to 30% variance), and future directions including mdspan, std::linalg, and executor-based threading strategies.
Sort: