Taming the Storm: Building Groww’s Internal Chaos Engineering Platform

Groww built an internal chaos engineering platform to validate system resilience before high-traffic events like IPO listings. The platform uses a dedicated load-test Kubernetes cluster isolated from production, a traffic replayer that mirrors real production traffic, Argo Workflows as the orchestration control plane, and Chaos Mesh for fault injection. Experiments include Redis latency injection, SQL downtime simulation, pod kills, and network partitions. Observability is handled through Groww's internal platform Olly, which annotates chaos events on shared dashboards. The platform was used to validate IPO-facing services, confirming autoscaling behavior and graceful degradation. Future plans include CI/CD integration, SLO-aware automatic experiment halting, and automated resilience reports.

#kubernetes

#distributed-systems

Mar 11•7m read time•From tech.groww.in

Table of contents

Introduction Goals: What we wanted to achieve Why We Chose Chaos Mesh Get Groww Engineering Team ’s stories in your inbox The Architecture 1. Multi-Cluster Setup 2. Traffic-Replayer: The Foundation of Realism 3. Argo Workflows: The Control Plane 4. Chaos Mesh: The Fault Injection Plane 5. Observability with Olly Validation: IPO Readiness Future Roadmap Conclusion: Reliability as a Culture

Comment

Bookmark

Copy

Sort: