Speeding up Timely Dataflow by 100x

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Frank McSherry demonstrates a 100x speedup in Timely Dataflow by introducing a third operator scheduling mode: 'notify only if holding a capability'. In a benchmark with 1,000 dataflows each containing ~1,000 operators, the naive approach required visiting all ~1,000,000 operators on every tick (similar to Flink's operator-to-operator progress propagation), taking ~350ms per iteration. By allowing REGION and ARRANGE operators to opt out of timestamp notifications when they hold no pending work, the system drops to ~4ms per iteration. The key insight is that Timely Dataflow tracks progress at the system level via timestamp capabilities rather than operator-to-operator communication, enabling it to skip dormant operators entirely. This matters especially for real-world deployments like Materialize's catalog server, which has ~12,000 operators across ~100 dataflows where most are idle at any given moment.

11m read timeFrom materialize.com
Post cover image
Table of contents
The set-up: big dataflowsA reality checkSmartness: tracking progress in timely dataflowOpting out of timestamp progressConclusionsTechnical details and sneaky caveats
1 Comment

Sort: