To get the most out of AI, optimizations are critical. When developers think about optimizing AI models for inference, model compression techniques—such as…

NVIDIA DevTalk serves as a vibrant community hub where developers can engage in discussions, seek assistance, and collaborate on projects involving NVIDIA hardware and software. Developers can tap into the collective expertise of the NVIDIA developer community, sharing insights, troubleshooting issues, and exploring best practices for GPU programming and AI development. Additionally, DevTalk provides a platform for developers to showcase their projects, receive feedback, and network with peers, fostering collaboration and knowledge exchange within the NVIDIA ecosystem.

NVIDIA Developer

NVIDIA introduces NVFP4, a new 4-bit floating point format for Blackwell GPUs that achieves ultra-low precision inference while maintaining model accuracy. NVFP4 uses innovative micro-block scaling with E4M3 precision and reduces memory footprint by 3.5x compared to FP16 and 1.8x compared to FP8. The format delivers up to 50x energy efficiency gains over H100 while showing minimal accuracy degradation (1% or less) on language modeling tasks. NVFP4 is supported by TensorRT Model Optimizer, vLLM, and SGLang, with pre-quantized models available on Hugging Face.

Introducing NVFP4 for Efficient and Accurate Low-Precision Inference

High-precision scaling: Encoding more signal, less error

Micro-block scaling for efficient model compression

NVFP4 versus FP8: Model performance and memory efficiency