The Mathematics of Backlogs: Capacity Planning for Queue Recovery

Queue backlogs in distributed systems can be solved with a small set of practical formulas rather than guesswork. The core insight is that systems provisioned exactly for steady-state traffic have zero surplus capacity and will never drain a backlog on their own. Key formulas covered include: drain time (backlog size / surplus capacity), the headroom formula for sizing consumer fleets against an RTO, and auto-scaling triggers based on queue growth rate rather than depth alone. The article also explains three critical failure modes: stale message degradation (apply a degradation factor to drain estimates), retry amplification leading to metastable failure states where recovery generates more load than it resolves, and cascading bottlenecks in multi-stage pipelines where scaling the wrong stage provides zero benefit. Load shedding via TTL-based message expiry is presented as a cost-effective alternative to over-provisioning. Post-incident measurement of degradation factor, retry amplification, and actual drain time is recommended to calibrate future estimates.

#kafka

#distributed-systems

May 13•19m read time•From infoq.com

Table of contents

Introduction The Three Numbers That Matter Little's Law: The One Formula Everyone Should Know How Backlogs Form and Drain The Complications That Actually Matter Cascading Backlogs in Multi-Stage Pipelines When to Shed Load Instead of Draining Capacity Planning: Turning Formulas Into Decisions Caveat: Unprocessable Messages and Dead-Letter Queues What to Measure and Record Conclusion About the Author

Comment

Bookmark

Copy

Sort: