Online reinforcement learning for LLMs (GRPO, PPO, RLHF) requires multiple cooperating services — policy/actor, rollout engine, reward model, reference model — that must coordinate on every iteration. Slurm's batch scheduling model fundamentally cannot express this: it lacks service discovery, enforces job time limits, has no component-level health checking, and offers no dynamic scaling. Every major RL framework (OpenRLHF, veRL, NeMo RL, TRL) has independently built the same glue code to bootstrap Ray clusters on top of Slurm, papering over its limitations. Real-world teams including Meta, H Company, and Danijar Hafner's lab have all hit this wall. SkyPilot Job Groups are presented as a solution: a multi-document YAML format where each RL component (trainer, rollout, reward, buffer) is a separate task with its own resource requirements, stable DNS-based service discovery, and independent failure recovery. Slurm remains strong for homogeneous pretraining and SFT batch jobs, but online RL has moved beyond its design space.
Table of contents
What a typical online RL pipeline looks like #Where Slurm’s model breaks down #How every major framework works around it #Teams that hit this wall #Running RL with SkyPilot Job Groups #The case for keeping Slurm #Where this is going #Sort: