Groww Engineering shares how they tackled alert fatigue in a large-scale SOA by building an SLO-driven observability system. The post explains the distinction between output metrics (symptoms) and input metrics (causes), then describes how SLIs and SLOs are used to derive a single health signal per service. They built an SLO Generator service that periodically queries Prometheus/Mimir metrics, evaluates configurable SLO constraints defined as Helm chart configs in ArgoCD, and emits breach metrics for alerting and dashboards. The design covers scheduler, runner, discovery, and generator components, supports diverse system profiles (Spring, gRPC, Next.js, Kafka, mobile apps, Cloudflare Workers), and scales to thousands of entities. Tradeoffs around metric density and time delay are discussed, along with additional use cases like long-term governance and user-facing status pages.
Table of contents
Symptoms vs CausesSLO / SLISLO Generator ConceptGet Groww Engineering Team ’s stories in your inboxSLO Generator DesignSLO OutputSLO AlertsCaveatsFurther Use CasesConclusionSort: