Groww's engineering team details how they prepare their trading platform for the 9:15 AM NSE market open. The system relies on four daily checkpoints (5 AM, 8:30–9:15 AM, 3:30 PM, 11:30 PM) and three protection layers: automated SLO alerting, AI anomaly detection, and human monitoring. A custom observability dashboard built on the LGTM stack (Loki, Grafana, Tempo, Mimir) covers six areas: SLO rules, API gateway health, on-call alerts, per-route app monitoring, capacity/memory signals, and consumer lag. Key lessons include treating 0% SLO readings as pipeline failures rather than service health, rejecting alert delay buffers in favor of noise reduction at the source, and deleting unactionable alerts. Upcoming improvements include monitoring the monitoring pipeline itself, intra-day consumer lag alerting, deployment blast radius estimation, and trend-based capacity alerting.
Table of contents
The Moment Everything Has to Be ReadyThree Layers of ProtectionBuilding the Dashboard That Runs the RoomGet Groww Engineering Team ’s stories in your inboxWhat We Are Building NextConclusionSort: