Databricks rebuilt their rate limiting service to handle orders-of-magnitude more traffic after launching real-time model serving. The original architecture used Envoy → Ratelimit Service → Redis, suffering from two network hops, P99 latencies of 10–20ms, and a single point of failure. The redesign made three coupled changes: moving counters in-memory using a sharded routing layer called Dicer (eliminating the Redis hop), switching to asynchronous batch-reporting where clients make local allow/reject decisions and report counts every ~100ms rather than waiting for synchronous checks, and replacing fixed-window counting with token bucket once in-memory storage made compare-and-set cheap. The result cut tail latency by roughly 10x and made server load predictable, at the explicit cost of ~5% overshoot accuracy between reporting cycles. Overshoot is bounded by a server-returned rejection rate, a client-side local rate limiter for extreme spikes, and the token bucket's negative-balance memory. The key insight is that algorithm choice, state storage location, and sync model are three coupled decisions, not independent ones.
Table of contents
ScyllaDB Founders Share What Real-Time AI Requires from the Database (Sponsored)A Counting ProblemMoving the Count In-memoryOptimistic Rate LimitingBounding The OvershootThree Coupled DecisionsConclusionSort: