When Good Locks Go Bad: Diagnosing a System Meltdown Under Load

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Engineers at Flipkart diagnosed a critical system failure during load testing for their Big Billion Days sale. Their Mirana service crashed under load due to excessive contention on a Redis distributed lock. Initial solutions using queuing failed because they violated the 'fail fast' principle. The team ultimately solved the problem by implementing an AtomicInteger-based semaphore to limit concurrent threads attempting lock acquisition. The key insight was optimizing for actual service performance (200-300ms per request) rather than downstream resource limits, reducing allowed concurrency from 128 to 5 threads per pod and achieving stable throughput of 90 QPS across 9 pods.

17m read timeFrom blog.flipkart.tech
Post cover image
Table of contents
Introduction: The Backbone of BBD ReadinessThe Investigation: Following the Trail of CluesThe First Attempt: The Queueing FallacySolution Two: Embracing the “Fail Fast” PrincipleGet Yash Agrawal’s stories in your inboxThe Final Twist: When the Math Is Right, but the Logic Is WrongConclusion: The Lessons We LearnedAcknowledgements

Sort: