SkyPilot Job Groups is a new feature that lets you define heterogeneous ML workloads in a single YAML file. It solves the coordination problem in RL post-training (GRPO, PPO, RLHF) where different components need different hardware: H100s for the policy trainer, cheaper GPUs for inference rollouts, and high-memory CPUs for

7m read timeFrom blog.skypilot.co
Post cover image
Table of contents
What you get #The problem #How it works #Full example: 5-component RLHF #Comparison to alternatives #Getting started #Limitations #

Sort: