We’re on a journey to advance and democratize artificial intelligence through open source and open science.

HuggingFace's platform is a resource for developers and researchers working in natural language processing (NLP) and machine learning, offering insights into NLP models, tools, and datasets. Through articles, tutorials, and open-source projects, HuggingFace offers insights into state-of-the-art NLP techniques, transformer architectures, and transfer learning methods. Developers can learn about using pre-trained models, fine-tuning strategies, and deploying NLP applications with HuggingFace's libraries and APIs.

Hugging Face

Hugging Face introduces co-located vLLM in TRL to solve GPU inefficiency in GRPO training. Previously, training and inference ran on separate GPUs, causing idle time and resource waste. The new approach runs both processes on the same GPUs, achieving up to 1.73× speedup while maintaining model quality. The solution includes vLLM's sleep() API for memory management, DeepSpeed ZeRO Stage 3 for large models, and seamless integration with distributed training tools. Successfully demonstrated on models up to 72B parameters with significant throughput improvements.

NO GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

🚀 Introduction 🧨 The Problem 💡 The Opportunity What It Enables 🧩 Design: From Separate Servers to Shared GPUs Server TRL Setup (Top Row) Co-located TRL Setup (Bottom Row) 🛠️ Implementation Notes 📊 Showcase: Co-located vs. Plain TRL Performance Experiment 1: 1.5B Model — Varying Batch Sizes Experiment 2: 1.5B Model — Varying Tensor Parallelism (TP) Experiment 3: 7B Model — Varying Batch Sizes Experiment 4: 7B Model — Varying Tensor Parallelism (TP) 📊 Scaling to 72B Model Sleep Mode in vLLM DeepSpeed Optimizations Accelerate Integration Experiment 5: Qwen2.5-Math-72B — Throughput, Accuracy, and Benchmark Results 🎓 Challenges & Lessons Learned & next steps Challenges Lessons Learned ✅ Conclusion ✅ Give It a Try! 📄 train_grpo_colocate.py 🚀 Introduction

🧩 Design: From Separate Servers to Shared GPUs

📊 Showcase: Co-located vs. Plain TRL Performance

🎓 Challenges & Lessons Learned & next steps