Trivago runs 50+ Kafka sink services across three regions (US, EU, ASIA) that materialize CDC events into service-local databases. Most sinks were idle the majority of the day yet consumed ~1 CPU core and 1 GB RAM each, wasting significant cluster capacity. CPU/memory-based autoscaling proved ineffective because sink workloads are I/O-bound and resource usage stays flat even when Kafka consumer lag grows. The solution was KEDA (Kubernetes Event-Driven Autoscaling) with its Kafka scaler, using consumer group lag as the scaling signal to scale deployments all the way down to zero replicas when idle. Key configuration parameters covered include `lagThreshold`, `activationLagThreshold`, `cooldownPeriod`, and `fallback`. An edge case around inactive consumer group cleanup for very low-traffic topics was solved by adding a Cron scaler to periodically wake sinks up. The rollout was gradual: one non-critical sink in one region first, then expanding after validation. Results: average daily consumption dropped from ~50 replica-hours per region to ~1–2 replica-hours.
Table of contents
Introduction / ContextBackground: Current Data FlowThe Problem: Idle Sinks Burning ResourcesWhy Traditional Autoscaling Wasn’t EnoughEvent-Driven Scaling with KEDABefore vs AfterOur SolutionWhat this gives us in practiceEdge case: consumer group cleanup for very low-traffic topicsMigration PathResultsConclusionSort: