Giant Swarm replaced their custom-built Kubernetes cluster management system with Cluster API (CAPA), live-migrating hundreds of enterprise AWS production clusters without downtime or data loss. The post details the technical mechanics: a CLI-based migration tool, a two-phase process covering CR migration and node transition, and a creative workaround involving forking HashiCorp Vault to extract root CA signing keys for certificate continuity. Key lessons include the importance of aligning team structure with stated priorities, avoiding premature expansion to new providers before completing the core migration, and the strategic value of adopting upstream open source when a custom solution is no longer differentiating. The migration took years, required company-wide involvement, and ultimately freed engineering capacity for higher-value work.
Table of contents
Where we started: a custom operator stackWhy Cluster APIChoosing to migrate liveThe migration mechanicsWhat we learnedWhen to replace custom-built with open sourceLooking backSort: