Modal manages over 20,000 GPUs across AWS, GCP, Azure, and OCI, encountering significant reliability and performance differences between cloud providers. Their GPU health system includes instance type benchmarking and selection, machine image preparation with automated testing, boot-time validation, and continuous passive monitoring (via DCGM and dmesg) plus weekly active healthchecks (DCGM diag, GPUBurn, NCCL tests). Key findings: Cloud providers vary dramatically in H100 performance (up to 50% differences), temperature management (some reaching 94°C), and ECC error rates. GPUs account for 58.7% of training failures in Meta's LLaMA 3 development, compared to just 0.5% for CPUs, highlighting the reliability gap.
Table of contents
Instance type testing and selectionMachine imagesInstance bootLifetime managementObservabilitySupportConclusions3 Comments
Sort: