MXFP8 and NVFP4 microscaling quantization formats natively supported by NVIDIA's Blackwell architecture (B200 GPUs) deliver significant inference speedups for diffusion models. Using diffusers and TorchAO, benchmarks on Flux.1-Dev, QwenImage, and LTX-2 show up to 1.26x speedup with MXFP8 and up to 1.68x with NVFP4 compared to BF16 baselines, while also reducing peak memory by up to ~3.5x. The post covers selective quantization strategies (skipping small or accuracy-critical layers), CUDA Graphs via torch.compile reduce-overhead mode to cut CPU overhead at small batch sizes, and LPIPS as a perceptual accuracy metric. Code and reproducible recipes are provided for all three models.
Table of contents
Background on MXFP8 and NVFP4Basic Usage with diffusers and TorchAOBasic UsageBenchmark ResultsTechnical ConsiderationsOptimizing Accuracy and Performance with Selective QuantizationConclusionResourcesSort: