https://cppcon.org
---

Matrix Multiplication Deep Dive || Cache Blocking, SIMD & Parallelization - Aliaksei Sala - CppCon 2025
---

Matrix multiplication is a fundamental operation in scientific computing, game development, AI, and numerous high-performance applications. While its mathematical definition is simple, achieving optimal performance in C++ is far from trivial. 
 
In this talk, we will explore different optimization techniques for matrix multiplication, from naive implementations to highly tuned versions leveraging modern hardware features. We will cover key performance-enhancing strategies such as loop unrolling, cache blocking, SIMD vectorization, parallelization using threads and more. Through benchmarking and profiling, we will measure the real impact of these optimizations. 
 
By the end of this session, attendees will gain insights into two critical questions: 
 
How hard is it to implement an optimized matrix multiplication in C++? How effective is C++ for achieving peak performance in this task? 
 
This talk is suitable for developers interested in performance optimization, computational efficiency, and modern C++ techniques for numerical computing. 

---

Slides: https://github.com/CppCon/CppCon2025/blob/main/Presentations/Achieving_Peak_Performance_for_Matrix_Multiplication.key

Work at Hudson River Trading (HRT): https://tinyurl.com/safxfctf
---

Aliaksei Sala
 
I’m a Lead Software Engineer at EPAM Systems with over 10 years of experience in C++ and high-performance computing. My background spans embedded systems, Linux, and AI acceleration, and I’m currently working with Tenstorrent’s RISC-V–based compute platform. I enjoy digging into performance-critical code, from optimizing matrix multiplication to exploring modern C++ techniques that push hardware efficiency. I’m also active in the C++ community and excited to share my work on performance engineering at CppCon.
---


CppCon is the annual, week-long face-to-face gathering for the entire C++ community. The conference is organized by the C++ community for the community. You will enjoy inspirational talks and a friendly atmosphere designed to help attendees learn from each other, meet interesting people, and generally have a stimulating experience. Taking place this year in Aurora, Colorado, near the Denver airport, and including multiple diverse tracks, the conference will appeal to anyone from C++ novices to experts.
Annual CppCon Conference - https://www.cppcon.org
https://www.linkedin.com/company/cppcon
https://x.com/cppcon
https://www.facebook.com/CppConference
https://www.reddit.com/r/cppcon/
https://mastodon.social/@CppCon
---

Videos Filmed & Edited by Bash Films: http://www.BashFilms.com
YouTube Channel Managed by Digital Medium Ltd: https://events.digital-medium.co.uk
---

#cpp #cplusplus #cppcon #cppprogramming #cplusplusprogramming #softwaredevelopment #softwareengineering #coding #code #computerscience #technology #technews #programming #programmer

ISO C++ is the official website for the C++ programming language standardization committee, providing information, resources, and updates on the evolution of the C++ language. With a focus on language features, library enhancements, and standardization efforts, ISO C++ keeps developers informed about the latest developments in C++ programming. Developers can learn about upcoming language features, review proposals for language changes, and contribute to the evolution of the C++ language through active participation in the standardization process.

A CppCon talk walking through progressive optimizations for dense matrix multiplication in C++, targeting near-peak CPU performance. Starting from a naive triple-loop implementation (152s), the speaker applies loop reordering, cache tiling, manual SIMD vectorization with AVX intrinsics, cache-aware blocking tuned to L1/L2 sizes, explicit register allocation, multithreading, tiled matrix layout repacking, C++26 SIMD library, loop unrolling via template metaprogramming, and software prefetching. Each step is benchmarked with Google Benchmark and compared against OpenBLAS. On an AMD Zen 5 (9950X), the final implementation beats OpenBLAS for doubles and achieves ~3 teraflops for bfloat16. The talk also covers assembly analysis with perf and VTune, compiler differences between GCC and Clang (up to 30% variance), and future directions including mdspan, std::linalg, and executor-based threading strategies.