NVIDIA NeMo RL introduces an end-to-end FP8 precision recipe for reinforcement learning training of LLMs. The approach applies block-wise quantized FP8 (from DeepSeek-V3) to linear layers in both the generation (vLLM) and training (Megatron Core) engines, reducing numerical disagreement between the two phases. Combined with importance sampling, the end-to-end FP8 recipe matches BF16 accuracy on Llama 3.1 8B and Qwen3-30B while delivering 15–25% throughput gains. Extending FP8 to KV cache and attention layers adds another ~30% speedup on rollout, yielding an overall ~48% speedup over BF16. Dynamic QKV scale recalibration at each training step handles the unique challenge of changing policy weights in RL, with only 2–3% overhead. Configuration examples for NeMo RL are provided.
Table of contents
FP8 for linear layers in RLResults for FP8 Linear Layer E2EExtending FP8 for KV cache and attentionTry End-to-End FP8 with NVIDIA NeMo RLGet startedSort: