NVIDIA NeMo RL introduces an end-to-end FP8 precision recipe for reinforcement learning training of LLMs. The approach applies block-wise quantized FP8 (from DeepSeek-V3) to linear layers in both the generation (vLLM) and training (Megatron Core) engines, reducing numerical disagreement between the two phases. Combined with importance sampling, the end-to-end FP8 recipe matches BF16 accuracy on Llama 3.1 8B and Qwen3-30B while delivering 15–25% throughput gains. Extending FP8 to KV cache and attention layers adds another ~30% speedup on rollout, yielding an overall ~48% speedup over BF16. Dynamic QKV scale recalibration at each training step handles the unique challenge of changing policy weights in RL, with only 2–3% overhead. Configuration examples for NeMo RL are provided.

8m read timeFrom developer.nvidia.com
Post cover image
Table of contents
FP8 for linear layers in RLResults for FP8 Linear Layer E2EExtending FP8 for KV cache and attentionTry End-to-End FP8 with NVIDIA NeMo RLGet started

Sort: