NVIDIA introduces NVFP4, a new 4-bit floating point format for Blackwell GPUs that achieves ultra-low precision inference while maintaining model accuracy. NVFP4 uses innovative micro-block scaling with E4M3 precision and reduces memory footprint by 3.5x compared to FP16 and 1.8x compared to FP8. The format delivers up to 50x energy efficiency gains over H100 while showing minimal accuracy degradation (1% or less) on language modeling tasks. NVFP4 is supported by TensorRT Model Optimizer, vLLM, and SGLang, with pre-quantized models available on Hugging Face.

10m read timeFrom developer.nvidia.com
Post cover image
Table of contents
What is NVFP4?High-precision scaling: Encoding more signal, less errorMicro-block scaling for efficient model compressionNVFP4 versus FP8: Model performance and memory efficiencyFP4 energy efficiencyGet started with NVFP4

Sort: