Building distributed systems is challenging due to their complexity, scale, and unpredictability. Common strategies to increase performance and reliability are queues, retries, caches and more. But be aware! Naively implemented optimizations have often the very much opposite effect to what you expect. While they can not only prevent your system from working they can even affect your recovery process, which is even worse.
In this session we will explore why you cannot simply add a retry here, a queue there and have a highly performant and reliable system. We will dive deeper into how to properly implement mechanisms and patterns in distributed systems to reduce failures and make recovery as smooth as possible.

Devoxx

A conference talk covering four key patterns for building reliable distributed systems. The speaker explains why naive retries cause 'spiral of death' under load and how token bucket mechanisms limit retry-induced overload to ~1% additional traffic. The fallback pattern is examined critically, with real examples like the OpenAI/ChatGPT outage caused by cache failure cascading to database overload. Load shedding is presented as a way to prioritize high-value traffic by dropping low-priority requests (bots, free-tier users) before timeouts waste server resources. Finally, the constant work pattern is introduced as a way to maintain predictable, deterministic load on downstream systems regardless of external event volume, illustrated with AWS EC2 provisioning and DNS update examples.

Distributed Systems - What Can Go Wrong and How to Get It Right by Florian Mair