Lessons From Our 8 Years Of Kubernetes In Production — Two Major Cluster Crashes, Ditching Self-Managed, Cutting Cluster Costs, Tooling, And More

Lessons learned from running Kubernetes in production for 8 years include the complexity of Kubernetes, the importance of managing Kubernetes certificates, keeping Kubernetes and Helm up to date, maintaining centralized Helm charts, disaster recovery planning, backing up secrets, considering vendor-agnostic vs "going all in" approaches, and optimizing node types and cost with reserved instances. Observability through monitoring, alerting, and logging is crucial, and security measures like access control and container vulnerability scanning are necessary. The company experienced two major cluster crashes due to certificate expirations. Migrating from self-managed on AWS to managed on Azure (AKS) improved ease of use, integrated Azure services, and reduced costs. Overall, Kubernetes has been a game-changer for the company, providing scalability, cost optimization, improved developer experiences, and faster time-to-market for new products and services.

#cloud

#startup

#devops

#kubernetes

#cloud-native

#observability