Hugging Face introduces co-located vLLM in TRL to solve GPU inefficiency in GRPO training. Previously, training and inference ran on separate GPUs, causing idle time and resource waste. The new approach runs both processes on the same GPUs, achieving up to 1.73Γ speedup while maintaining model quality. The solution includes
β’14m read timeβ’ From huggingface.co
Table of contents
π Introduction 𧨠The Problem π‘ The Opportunity What It Enables π§© Design: From Separate Servers to Shared GPUs Server TRL Setup (Top Row) Co-located TRL Setup (Bottom Row) π οΈ Implementation Notes π Showcase: Co-located vs. Plain TRL Performance Experiment 1: 1.5B Model β Varying Batch Sizes Experiment 2: 1.5B Model β Varying Tensor Parallelism (TP) Experiment 3: 7B Model β Varying Batch Sizes Experiment 4: 7B Model β Varying Tensor Parallelism (TP) π Scaling to 72B Model Sleep Mode in vLLM DeepSpeed Optimizations Accelerate Integration Experiment 5: Qwen2.5-Math-72B β Throughput, Accuracy, and Benchmark Results π Challenges & Lessons Learned & next steps Challenges Lessons Learned β
Conclusion β
Give It a Try! π train_grpo_colocate.py π Introduction𧨠The Problemπ‘ The Opportunityπ§© Design: From Separate Servers to Shared GPUsπ οΈ Implementation Notesπ Showcase: Co-located vs. Plain TRL Performanceπ Scaling to 72B Modelπ Challenges & Lessons Learned & next stepsβ
Conclusionβ
Give It a Try!Sort: