A one-line Kubernetes fix that saved 600 hours a year

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Cloudflare's Atlantis (Terraform automation tool) was taking 30 minutes to restart due to a Kubernetes default behavior: when a pod mounts a PersistentVolume with an fsGroup set, kubelet recursively runs chgrp on every file in the volume on every mount. With millions of files accumulated on the PV, this became a massive bottleneck. The fix was a single-line change to the pod's securityContext — setting fsGroupChangePolicy to OnRootMismatch instead of the default Always — which reduced restart time from 30 minutes to 30 seconds, recovering roughly 600 hours of blocked engineering time per year.

7m read timeFrom blog.cloudflare.com
Post cover image
Table of contents
Mysteriously slow restartsBad behaviorGoing deeperThe missing pieceThe fix

Sort: