Agoda processes hundreds of terabytes of Kafka data daily for real-time price updates from suppliers. Standard round-robin partitioning caused over-provisioning due to heterogeneous hardware and uneven message workloads. Static solutions like identical pod deployments and weighted load balancing were rejected as impractical. Instead, Agoda built a dynamic lag-aware system with two components: a lag-aware producer that routes fewer messages to high-lag partitions using Same-Queue Length and Outlier Detection algorithms, and lag-aware consumers that proactively unsubscribe to trigger rebalancing when experiencing high lag, leveraging Kafka 2.4's incremental cooperative rebalance protocol.
Table of contents
Partitioner and Assignor StrategyThe Over-provisioning Problem At AgodaThe Solution Agoda Didn’t AdoptAgoda’s Dynamic Lag-Aware SolutionAlgorithms for Lag-aware ProducerSort: