Scaling LLM Post-Training at Netflix

Netflix built an internal post-training framework to scale LLM fine-tuning from experimentation to production. The framework abstracts infrastructure complexity across four dimensions: data (streaming, sequence packing, loss masking), model (sharding, LoRA, architecture support), compute (distributed job orchestration, checkpointing, MFU monitoring), and workflow (supporting both SFT and on-policy RL). Key engineering decisions include staying Hugging Face-compatible for interoperability, maintaining optimized internal model implementations for performance, and evolving from SPMD-only execution to hybrid orchestration for RL workflows. The system enables researchers to focus on modeling rather than distributed systems plumbing.

#machine-learning

#llm

#pytorch

#netflix

#distributed-systems

Feb 13•13m read time•From netflixtechblog.com

Table of contents

Introduction A Model Developer’s Post-Training Journey The Netflix Post-Training Framework Learnings from Building the Post-Training Framework Get Netflix Technology Blog ’s stories in your inbox Wrap up Acknowledgements

Comment

Bookmark

Copy

Sort: