A hands-on exploration of using DirectX 12 Cooperative Vectors (preview) to accelerate MLP inference via Nvidia Tensor cores in HLSL shaders. Covers setup requirements (Agility SDK 1.717.1-preview, DXC with SM6.9, Nvidia 590.26 driver), enabling experimental features, defining long vectors and MatrixRef/VectorRef types, weight conversion from float32 to float16, buffer alignment requirements (128-byte for weights, 64-byte for biases), and a complete 2-hidden-layer MLP implementation. Performance benchmarks on an RTX 3080 mobile show modest 2x speedups for small networks (3-3-3-3) but dramatic gains for larger ones: a 6-32-32-32-1 MLP achieves 41.7x speedup and a 6-64-64-64-1 MLP achieves 173x speedup over an unoptimized compute shader baseline. The feature in its current form will be superseded by the Linear Algebra Matrix spec targeting SM6.10.

14m read timeFrom interplayoflight.wordpress.com
Post cover image
Table of contents
Share this:Related

Sort: