Queue backlogs in distributed systems can be solved with a small set of practical formulas rather than guesswork. The core insight is that systems provisioned exactly for steady-state traffic have zero surplus capacity and will never drain a backlog on their own. Key formulas covered include: drain time (backlog size / surplus capacity), the headroom formula for sizing consumer fleets against an RTO, and auto-scaling triggers based on queue growth rate rather than depth alone. The article also explains three critical failure modes: stale message degradation (apply a degradation factor to drain estimates), retry amplification leading to metastable failure states where recovery generates more load than it resolves, and cascading bottlenecks in multi-stage pipelines where scaling the wrong stage provides zero benefit. Load shedding via TTL-based message expiry is presented as a cost-effective alternative to over-provisioning. Post-incident measurement of degradation factor, retry amplification, and actual drain time is recommended to calibrate future estimates.

19m read timeFrom infoq.com
Post cover image
Table of contents
IntroductionThe Three Numbers That MatterLittle's Law: The One Formula Everyone Should KnowHow Backlogs Form and DrainThe Complications That Actually MatterCascading Backlogs in Multi-Stage PipelinesWhen to Shed Load Instead of DrainingCapacity Planning: Turning Formulas Into DecisionsCaveat: Unprocessable Messages and Dead-Letter QueuesWhat to Measure and RecordConclusionAbout the Author

Sort: