In 2025, NVIDIA partnered with Black Forest Labs (BFL) to optimize the FLUX.1 text-to-image model series, unlocking FP4 image generation performance on NVIDIA…

NVIDIA DevTalk serves as a vibrant community hub where developers can engage in discussions, seek assistance, and collaborate on projects involving NVIDIA hardware and software. Developers can tap into the collective expertise of the NVIDIA developer community, sharing insights, troubleshooting issues, and exploring best practices for GPU programming and AI development. Additionally, DevTalk provides a platform for developers to showcase their projects, receive feedback, and network with peers, fostering collaboration and knowledge exchange within the NVIDIA ecosystem.

NVIDIA Developer

NVIDIA achieved a 10.2x performance improvement for FLUX.2 text-to-image model inference on Blackwell B200 GPUs compared to H200. The optimization combines NVFP4 4-bit quantization, TeaCache (a diffusion step-skipping technique), CUDA Graphs, torch.compile, and multi-GPU parallelism. NVFP4 uses two-level microblock scaling to minimize accuracy loss while reducing precision. TeaCache conditionally skips ~30% of diffusion steps by reusing previous latents. Multi-GPU support via TensorRT-LLM visual_gen provides near-linear scaling. The implementation is available as an open-source example in the TensorRT-LLM repository.

Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs

Visual comparison between BF16 and NVFP4 with FLUX.2 [dev]

Get started with FLUX.2 on NVIDIA Blackwell GPUs