Databricks rebuilt their rate limiting service to handle orders-of-magnitude more traffic after launching real-time model serving. The original architecture used Envoy → Ratelimit Service → Redis, suffering from two network hops, P99 latencies of 10–20ms, and a single point of failure. The redesign made three coupled changes: moving counters in-memory using a sharded routing layer called Dicer (eliminating the Redis hop), switching to asynchronous batch-reporting where clients make local allow/reject decisions and report counts every ~100ms rather than waiting for synchronous checks, and replacing fixed-window counting with token bucket once in-memory storage made compare-and-set cheap. The result cut tail latency by roughly 10x and made server load predictable, at the explicit cost of ~5% overshoot accuracy between reporting cycles. Overshoot is bounded by a server-returned rejection rate, a client-side local rate limiter for extreme spikes, and the token bucket's negative-balance memory. The key insight is that algorithm choice, state storage location, and sync model are three coupled decisions, not independent ones.

11m read timeFrom blog.bytebytego.com
Post cover image
Table of contents
ScyllaDB Founders Share What Real-Time AI Requires from the Database (Sponsored)A Counting ProblemMoving the Count In-memoryOptimistic Rate LimitingBounding The OvershootThree Coupled DecisionsConclusion

Sort: