Meta discusses the challenges of maintaining large-scale AI capacity and how they are transforming their fleet to support the rise of artificial intelligence. They highlight the characteristics of GPU training, the unique aspects of Meta's GPU training infrastructure, the use of maintenance trains for fleet maintenance, and the role of OpsPlanner in orchestrating disruptive work.

7m read timeFrom engineering.fb.com
Post cover image
Table of contents
The main characteristics of GPU trainingWhat’s special about Meta’s GPU training?Maintenance trainsGradual rolloutsSelecting the correct maintenance domainsOpsPlanner: Meta disruptive-work orchestratorSafety and failure scenariosRapidly moving to the future of generative AI

Sort: