Agoda processes hundreds of terabytes of Kafka data daily for real-time price updates from suppliers. Standard round-robin partitioning caused over-provisioning due to heterogeneous hardware and uneven message workloads. Static solutions like identical pod deployments and weighted load balancing were rejected as impractical. Instead, Agoda built a dynamic lag-aware system with two components: a lag-aware producer that routes fewer messages to high-lag partitions using Same-Queue Length and Outlier Detection algorithms, and lag-aware consumers that proactively unsubscribe to trigger rebalancing when experiencing high lag, leveraging Kafka 2.4's incremental cooperative rebalance protocol.

9m read timeFrom newsletter.systemdesigncodex.com
Post cover image
Table of contents
Partitioner and Assignor StrategyThe Over-provisioning Problem At AgodaThe Solution Agoda Didn’t AdoptAgoda’s Dynamic Lag-Aware SolutionAlgorithms for Lag-aware Producer

Sort: