The Signal in the Noise: Scaling Observability with SLO-Driven Alerting

Groww Engineering shares how they tackled alert fatigue in a large-scale SOA by building an SLO-driven observability system. The post explains the distinction between output metrics (symptoms) and input metrics (causes), then describes how SLIs and SLOs are used to derive a single health signal per service. They built an SLO Generator service that periodically queries Prometheus/Mimir metrics, evaluates configurable SLO constraints defined as Helm chart configs in ArgoCD, and emits breach metrics for alerting and dashboards. The design covers scheduler, runner, discovery, and generator components, supports diverse system profiles (Spring, gRPC, Next.js, Kafka, mobile apps, Cloudflare Workers), and scales to thousands of entities. Tradeoffs around metric density and time delay are discussed, along with additional use cases like long-term governance and user-facing status pages.

#devops

#observability

#opentelemetry

#prometheus

May 11•10m read time•From tech.groww.in

Table of contents

Symptoms vs Causes SLO / SLI SLO Generator Concept Get Groww Engineering Team ’s stories in your inbox SLO Generator Design SLO Output SLO Alerts Caveats Further Use Cases Conclusion

Comment

Bookmark

Copy

Sort: