NVIDIA achieved a 10.2x performance improvement for FLUX.2 text-to-image model inference on Blackwell B200 GPUs compared to H200. The optimization combines NVFP4 4-bit quantization, TeaCache (a diffusion step-skipping technique), CUDA Graphs, torch.compile, and multi-GPU parallelism. NVFP4 uses two-level microblock scaling to
Table of contents
Visual comparison between BF16 and NVFP4 with FLUX.2 [dev]Optimizing FLUX.2 [dev]Performance analysisGet started with FLUX.2 on NVIDIA Blackwell GPUsSort: