When Good Locks Go Bad: Diagnosing a System Meltdown Under Load

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Engineers at Flipkart diagnosed a critical system failure during load testing for their Big Billion Days sale. Their Mirana service crashed under load due to excessive contention on a Redis distributed lock. Initial solutions using queuing failed because they violated the 'fail fast' principle. The team ultimately solved the problem by implementing an AtomicInteger-based semaphore to limit concurrent threads attempting lock acquisition. The key insight was optimizing for actual service performance (200-300ms per request) rather than downstream resource limits, reducing allowed concurrency from 128 to 5 threads per pod and achieving stable throughput of 90 QPS across 9 pods.

#performance

#kubernetes

#redis

#distributed-systems

Nov 10, 2025•17m read time•From blog.flipkart.tech

Table of contents

Introduction: The Backbone of BBD Readiness The Investigation: Following the Trail of Clues The First Attempt: The Queueing Fallacy Solution Two: Embracing the “Fail Fast” Principle Get Yash Agrawal’s stories in your inbox The Final Twist: When the Math Is Right, but the Logic Is Wrong Conclusion: The Lessons We Learned Acknowledgements

Comment

Bookmark

Copy

Sort: