NVFP4 is NVIDIA’s highly efficient 4-bit floating-point quantization format tailored to optimize the deployment of large language models while ensuring near-baseline accuracy. Recently, Red Hat has introduced NVFP4-quantized versions of well-known models, with parameters ranging from 8 billion to over 400 billion. These models demonstrate remarkable accuracy recovery; specifically, they achieve up to 99% accuracy retention in large models (70B-235B parameters), 97-99% in mid-sized models (~30B parameters), and 95-98% in smaller models (7B-14B parameters).
The NVFP4 format offers notable advantages in storage efficiency and computational acceleration. It requires 1.5 to 1.8 times less storage compared to FP8 and about three times less than FP16. The format is further bolstered by native hardware acceleration on NVIDIA Blackwell GPUs, contributing significantly to its efficiency.
Interestingly, NVFP4’s accuracy recovery enhances as the model size increases, marking it as exceptionally effective for both frontier-scale applications and mixture-of-experts architectures. This scalability positions NVFP4 as a pivotal player in advancing the capabilities of large language models.

Collections

NVFP4 is NVIDIA's 4-bit floating-point quantization format designed to optimize large language model deployment with minimal accuracy loss. Red Hat has released NVFP4-quantized models ranging from 8B to 400B+ parameters, achieving 95-99% accuracy retention depending on model size. The format provides 1.5-1.8x storage reduction versus FP8 and 3x versus FP16, with native hardware acceleration on NVIDIA Blackwell GPUs. Accuracy recovery improves with larger models, making NVFP4 particularly effective for frontier-scale and mixture-of-experts architectures.

Accelerating Large Language Models with NVFP4 Quantization