Comprehensive guide to programming Matrix Cores on AMD CDNA3 and CDNA4 architectures using HIP kernels. Covers low-precision floating-point types (FP16, FP8, FP6, FP4), compiler intrinsics for matrix fused-multiply-add operations, and data layouts required by Matrix Core instructions. Includes detailed code examples demonstrating how to leverage Matrix Cores for up to 64x performance gains over FP32 operations, with focus on mixed-precision matrix multiplication and the new block exponent scaling instructions in CDNA4.

25m read timeFrom salykova.github.io
Post cover image
Table of contents
1. Matrix Cores2. Low-Precision Floating-Point Types3. Matrix fused-multiply-add (MFMA) Instructions4. Compiler Intrinsics5. ExamplesSummary

Sort: