SkyPilot Job Groups is a new feature that lets you define heterogeneous ML workloads in a single YAML file. It solves the coordination problem in RL post-training (GRPO, PPO, RLHF) where different components need different hardware: H100s for the policy trainer, cheaper GPUs for inference rollouts, and high-memory CPUs for replay buffers. Key capabilities include automatic provisioning of mixed instance types, DNS-based service discovery between tasks, declarative lifecycle management via primary/auxiliary task designation, and automatic recovery from preemptions. The same YAML runs on AWS, GCP, Azure, or Kubernetes without per-cloud configuration. Limitations include single-region constraint, Kubernetes requirement for DNS-based discovery, and fixed replica counts at submission time.

7m read timeFrom blog.skypilot.co
Post cover image
Table of contents
What you get #The problem #How it works #Full example: 5-component RLHF #Comparison to alternatives #Getting started #Limitations #

Sort: