Reddit's engineering team migrated its entire Apache Kafka fleet — over 500 brokers and more than a petabyte of live data — from Amazon EC2 to Kubernetes using Strimzi, with zero downtime and no client-side changes. The migration was executed in six phases: introducing a DNS abstraction layer to decouple clients from broker addresses, freeing up broker ID space by reshuffling EC2 brokers, running a mixed EC2/Kubernetes cluster via a forked Strimzi operator, gradually shifting partition leadership and data using Cruise Control, migrating the control plane from ZooKeeper to KRaft, and finally handing off to the standard Strimzi operator. Key lessons include using abstraction layers to decouple clients from infrastructure, treating logical state as the primary asset to protect, and designing every migration step to be reversible.
Table of contents
Cut Code Review Time & Bugs in Half (Sponsored)The Role of Kafka at RedditWhy Reddit Wanted to Move Away from EC2The Four Constraints That Shaped the MigrationPhase 1: Taking Control of the Naming LayerPhase 2: Making Room for New BrokersPhase 3: Running a Mixed ClusterPhase 4: Gradually Shifting Data and TrafficPhase 5: Migrating the Control PlanePhase 6: Cleaning Up and Handing Off to Standard StrimziConclusionSort: