Slack improved their Chef infrastructure safety by splitting a single production environment into six isolated buckets (prod-1 through prod-6) mapped to availability zones, implementing a release train model with staggered rollouts. They built Chef Summoner, a service that triggers Chef runs based on S3 signals rather than fixed cron schedules, reducing blast radius during deployments. The approach avoided disruptive migration to Policyfiles while achieving safer deployments. Changes now take longer to propagate but provide time to catch issues before full rollout. A fallback cron job ensures Chef runs every 12 hours even if Summoner fails, maintaining compliance.
Sort: