A hands-on exploration of using DirectX 12 Cooperative Vectors (preview) to accelerate MLP inference via Nvidia Tensor cores in HLSL shaders. Covers setup requirements (Agility SDK 1.717.1-preview, DXC with SM6.9, Nvidia 590.26 driver), enabling experimental features, defining long vectors and MatrixRef/VectorRef types, weight conversion from float32 to float16, buffer alignment requirements (128-byte for weights, 64-byte for biases), and a complete 2-hidden-layer MLP implementation. Performance benchmarks on an RTX 3080 mobile show modest 2x speedups for small networks (3-3-3-3) but dramatic gains for larger ones: a 6-32-32-32-1 MLP achieves 41.7x speedup and a 6-64-64-64-1 MLP achieves 173x speedup over an unoptimized compute shader baseline. The feature in its current form will be superseded by the Linear Algebra Matrix spec targeting SM6.10.
Sort: