MXFP8 and NVFP4 microscaling quantization formats natively supported by NVIDIA's Blackwell architecture (B200 GPUs) deliver significant inference speedups for diffusion models. Using diffusers and TorchAO, benchmarks on Flux.1-Dev, QwenImage, and LTX-2 show up to 1.26x speedup with MXFP8 and up to 1.68x with NVFP4 compared to

12m read timeFrom pytorch.org
Post cover image
Table of contents
Background on MXFP8 and NVFP4Basic Usage with diffusers and TorchAOBasic UsageBenchmark ResultsTechnical ConsiderationsOptimizing Accuracy and Performance with Selective QuantizationConclusionResources

Sort: