Groww built an internal chaos engineering platform to validate system resilience before high-traffic events like IPO listings. The platform uses a dedicated load-test Kubernetes cluster isolated from production, a traffic replayer that mirrors real production traffic, Argo Workflows as the orchestration control plane, and Chaos Mesh for fault injection. Experiments include Redis latency injection, SQL downtime simulation, pod kills, and network partitions. Observability is handled through Groww's internal platform Olly, which annotates chaos events on shared dashboards. The platform was used to validate IPO-facing services, confirming autoscaling behavior and graceful degradation. Future plans include CI/CD integration, SLO-aware automatic experiment halting, and automated resilience reports.

7m read timeFrom tech.groww.in
Post cover image
Table of contents
IntroductionGoals: What we wanted to achieveWhy We Chose Chaos MeshGet Groww Engineering Team ’s stories in your inboxThe Architecture1. Multi-Cluster Setup2. Traffic-Replayer: The Foundation of Realism3. Argo Workflows: The Control Plane4. Chaos Mesh: The Fault Injection Plane5. Observability with OllyValidation: IPO ReadinessFuture RoadmapConclusion: Reliability as a Culture

Sort: