How we do active and passive monitoring on hyperscalers and neoclouds.

Modal

Modal manages over 20,000 GPUs across AWS, GCP, Azure, and OCI, encountering significant reliability and performance differences between cloud providers. Their GPU health system includes instance type benchmarking and selection, machine image preparation with automated testing, boot-time validation, and continuous passive monitoring (via DCGM and dmesg) plus weekly active healthchecks (DCGM diag, GPUBurn, NCCL tests). Key findings: Cloud providers vary dramatically in H100 performance (up to 50% differences), temperature management (some reaching 94°C), and ECC error rates. GPUs account for 58.7% of training failures in Meta's LLaMA 3 development, compared to just 0.5% for CPUs, highlighting the reliability gap.

Keeping 20,000 GPUs healthy

<p>Allow them out in daylight for at least four hours a day, make sure that there’s an ample supply of straw for their nests. GPUs like to make a fresh nest every other day, so this is paramount.
A diet of grains and root vegetables will ensure their fur stays thick and glossy, especially important for the cold, winter months.
GPUs prefer running water, so if you have access to a stream or river with shallow banks, so much the better.</p>