From Always-On to On-Demand: Scaling Kafka Sinks with KED...

Trivago runs 50+ Kafka sink services across three regions (US, EU, ASIA) that materialize CDC events into service-local databases. Most sinks were idle the majority of the day yet consumed ~1 CPU core and 1 GB RAM each, wasting significant cluster capacity. CPU/memory-based autoscaling proved ineffective because sink workloads are I/O-bound and resource usage stays flat even when Kafka consumer lag grows. The solution was KEDA (Kubernetes Event-Driven Autoscaling) with its Kafka scaler, using consumer group lag as the scaling signal to scale deployments all the way down to zero replicas when idle. Key configuration parameters covered include `lagThreshold`, `activationLagThreshold`, `cooldownPeriod`, and `fallback`. An edge case around inactive consumer group cleanup for very low-traffic topics was solved by adding a Cron scaler to periodically wake sinks up. The rollout was gradual: one non-critical sink in one region first, then expanding after validation. Results: average daily consumption dropped from ~50 replica-hours per region to ~1–2 replica-hours.

#kubernetes

#architecture

#apache-kafka

Feb 18•16m read time•From tech.trivago.com

Table of contents

Introduction / Context Background: Current Data Flow The Problem: Idle Sinks Burning Resources Why Traditional Autoscaling Wasn’t Enough Event-Driven Scaling with KEDA Before vs After Our Solution What this gives us in practice Edge case: consumer group cleanup for very low-traffic topics Migration Path Results Conclusion

Comment

Bookmark

Copy

Sort: