NetEase Games shares how they reduced LLM cold start times from 42 minutes to under 30 seconds on Kubernetes by adopting Fluid, a CNCF incubating project that adds operational control over data caching for inference workloads. The core problem was that serverless GPU autoscaling was undermined by slow model loading — pulling 70B-class model weights from remote storage took tens of minutes. Fluid addressed this by providing Kubernetes-native dataset abstractions, prefetch workflows, cross-namespace model sharing, and cache elasticity via HPA/KEDA. The result was practical elastic inference: GPU resources could be scaled down aggressively during quiet periods, and shared foundation models no longer needed to be cached redundantly per team.

6m read timeFrom cncf.io
Post cover image
Table of contents
The Day 2 problem: Cold starts, shared models, and fragmented GPU capacityWhy we didn’t just run Alluxio directlyFluid: Adding operational control to AlluxioWhat changed in productionA useful way to frame the choice

Sort: