Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision

NVIDIA NeMo RL introduces an end-to-end FP8 precision recipe for reinforcement learning training of LLMs. The approach applies block-wise quantized FP8 (from DeepSeek-V3) to linear layers in both the generation (vLLM) and training (Megatron Core) engines, reducing numerical disagreement between the two phases. Combined with importance sampling, the end-to-end FP8 recipe matches BF16 accuracy on Llama 3.1 8B and Qwen3-30B while delivering 15–25% throughput gains. Extending FP8 to KV cache and attention layers adds another ~30% speedup on rollout, yielding an overall ~48% speedup over BF16. Dynamic QKV scale recalibration at each training step handles the unique challenge of changing policy weights in RL, with only 2–3% overhead. Configuration examples for NeMo RL are provided.

#llm

#reinforcement-learning

Apr 20•8m read time•From developer.nvidia.com

Table of contents

FP8 for linear layers in RL Results for FP8 Linear Layer E2E Extending FP8 for KV cache and attention Try End-to-End FP8 with NVIDIA NeMo RL Get started

Comment

Bookmark

Copy

Sort: