When we investigated why our Atlantis instance took 30 minutes to restart, we discovered a bottleneck in how Kubernetes handles volume permissions. By adjusting the fsGroupChangePolicy, we reduced restart times to 30 seconds.

Cloudflare's platform is a leading provider of internet security and performance solutions, offering insights into web security, content delivery, and DNS management. Through documentation, blog posts, and webinars, Cloudflare provides insights into protecting websites and applications from cyber threats and improving performance. Developers and IT professionals can learn about CDN (Content Delivery Network), DDoS mitigation, and firewall configurations to secure and accelerate web traffic.

Cloudflare

Cloudflare's Atlantis (Terraform automation tool) was taking 30 minutes to restart due to a Kubernetes default behavior: when a pod mounts a PersistentVolume with an fsGroup set, kubelet recursively runs chgrp on every file in the volume on every mount. With millions of files accumulated on the PV, this became a massive bottleneck. The fix was a single-line change to the pod's securityContext — setting fsGroupChangePolicy to OnRootMismatch instead of the default Always — which reduced restart time from 30 minutes to 30 seconds, recovering roughly 600 hours of blocked engineering time per year.

A one-line Kubernetes fix that saved 600 hours a year