How NetEase Games achieved 30-second LLM cold starts on Kubernetes

NetEase Games shares how they reduced LLM cold start times from 42 minutes to under 30 seconds on Kubernetes by adopting Fluid, a CNCF incubating project that adds operational control over data caching for inference workloads. The core problem was that serverless GPU autoscaling was undermined by slow model loading — pulling 70B-class model weights from remote storage took tens of minutes. Fluid addressed this by providing Kubernetes-native dataset abstractions, prefetch workflows, cross-namespace model sharing, and cache elasticity via HPA/KEDA. The result was practical elastic inference: GPU resources could be scaled down aggressively during quiet periods, and shared foundation models no longer needed to be cached redundantly per team.

#tech-news

#kubernetes

#ai-inference

May 21•6m read time•From cncf.io

Table of contents

The Day 2 problem: Cold starts, shared models, and fragmented GPU capacity Why we didn’t just run Alluxio directly Fluid: Adding operational control to Alluxio What changed in production A useful way to frame the choice

Comment

Bookmark

Copy

Sort: