Zalando's engineering team shares how they reduced Apache Flink state size by 75% (from 240GB to 56GB) by replacing chained Table API SQL joins with a custom DataStream API solution. The root cause was state amplification: each SQL join operator independently stores its own copy of all keyed data in RocksDB, so four chained joins multiply state rather than add it. This caused 12-minute snapshot windows, CPU exhaustion, crash-restart loops, and inflated AWS costs. The fix was a single KeyedProcessFunction called MultiStreamJoinProcessor that unions all input streams and maintains one shared ValueState per SKU, eliminating redundant copies. The approach also enabled built-in deduplication and timestamp-based filtering. Results included 77% faster snapshots, stable CPU at ~30% vs 100% spikes, and 13% AWS cost reduction. The post also notes that Flink 2.1 introduced an experimental MultiJoin operator implementing the same concept natively, but it wasn't available on AWS Managed Flink at the time.
Table of contents
The Initial Architecture: The Attraction of SQLWhy State AmplifiesThe State NightmareFrom Table API to Stream APIResult: 75% State DecreaseSort: