How to achieve truly serverless GPUs

Modal's engineering team details four techniques that reduce GPU inference server cold start times from tens of minutes to tens of seconds. The approach combines: (1) cloud buffers of pre-warmed idle GPUs to eliminate instance allocation latency, (2) a custom FUSE-based lazy-loading filesystem with a multi-tier content-addressed cache to cut container start from minutes to seconds, (3) CPU-side checkpoint/restore via gVisor's runsc to skip Python import and library initialization overhead, and (4) CUDA checkpoint/restore using Nvidia driver support to snapshot and restore GPU memory state. Together these techniques deliver roughly 40x speedup. Real-world data from 35M+ CPU snapshot restorations and 15M+ CPU+GPU snapshot restorations over three months is shared, along with benchmark comparisons for vLLM and SGLang serving a 1 GiB model showing mean boot times dropping from ~95s to ~14s.

#serverless

#gpu

#ai-inference

May 12•24m read time•From modal.com

Table of contents

Why care about serverless GPUs? To maximize GPU Allocation Utilization for inference workloads.What’s so hard about serverless GPUs? Startup latency.You can remove tens of minutes of latency by taking instance allocation and health checks out of the hot path.You can cut container start from minutes to seconds by serving files lazily out of a content-addressed cache.You can fast-forward through tens of seconds of application host-side startup with CPU memory snapshotting.You can fast-forward through minutes of application device startup with GPU memory snapshotting.We have run this stack at the scale of tens of millions of replicas across many use cases.Coda