Hugging Face introduces co-located vLLM in TRL to solve GPU inefficiency in GRPO training. Previously, training and inference ran on separate GPUs, causing idle time and resource waste. The new approach runs both processes on the same GPUs, achieving up to 1.73Γ— speedup while maintaining model quality. The solution includes

β€’14m read timeβ€’ From huggingface.co
Post cover image
Table of contents
πŸš€ Introduction 🧨 The Problem πŸ’‘ The Opportunity What It Enables 🧩 Design: From Separate Servers to Shared GPUs Server TRL Setup (Top Row) Co-located TRL Setup (Bottom Row) πŸ› οΈ Implementation Notes πŸ“Š Showcase: Co-located vs. Plain TRL Performance Experiment 1: 1.5B Model β€” Varying Batch Sizes Experiment 2: 1.5B Model β€” Varying Tensor Parallelism (TP) Experiment 3: 7B Model β€” Varying Batch Sizes Experiment 4: 7B Model β€” Varying Tensor Parallelism (TP) πŸ“Š Scaling to 72B Model Sleep Mode in vLLM DeepSpeed Optimizations Accelerate Integration Experiment 5: Qwen2.5-Math-72B β€” Throughput, Accuracy, and Benchmark Results πŸŽ“ Challenges & Lessons Learned & next steps Challenges Lessons Learned βœ… Conclusion βœ… Give It a Try! πŸ“„ train_grpo_colocate.py πŸš€ Introduction🧨 The ProblemπŸ’‘ The Opportunity🧩 Design: From Separate Servers to Shared GPUsπŸ› οΈ Implementation NotesπŸ“Š Showcase: Co-located vs. Plain TRL PerformanceπŸ“Š Scaling to 72B ModelπŸŽ“ Challenges & Lessons Learned & next stepsβœ… Conclusionβœ… Give It a Try!

Sort: