Netflix engineers traced severe container startup stalls to global mount lock contention deep in the Linux kernel's VFS layer, not Kubernetes or containerd. During high-concurrency bursts, thousands of bind mount operations competed for a single kernel lock, causing nodes to freeze for tens of seconds. The team found CPU architecture matters significantly: older dual-socket NUMA instances with mesh cache coherence suffered far worse than newer single-socket AMD/Intel instances. Mitigations included redesigning overlay filesystem construction to reduce per-container mount operations from O(n) layers to O(1), and routing workloads to hardware architectures that handle global lock contention more gracefully. Disabling hyperthreading improved latency by up to 30% in some configurations. The findings underscore that predictable container scaling at scale requires co-design across container runtimes, kernel internals, and CPU microarchitecture.
Sort: