NetEase Games shares how they reduced LLM cold start times from 42 minutes to under 30 seconds on Kubernetes by adopting Fluid, a CNCF incubating project that adds operational control over data caching for inference workloads. The core problem was that serverless GPU autoscaling was undermined by slow model loading — pulling 70B-class model weights from remote storage took tens of minutes. Fluid addressed this by providing Kubernetes-native dataset abstractions, prefetch workflows, cross-namespace model sharing, and cache elasticity via HPA/KEDA. The result was practical elastic inference: GPU resources could be scaled down aggressively during quiet periods, and shared foundation models no longer needed to be cached redundantly per team.
Table of contents
The Day 2 problem: Cold starts, shared models, and fragmented GPU capacityWhy we didn’t just run Alluxio directlyFluid: Adding operational control to AlluxioWhat changed in productionA useful way to frame the choiceSort: