What a two-person SRE team learned building an AI investigation pipeline. Spoiler: the runbooks mattered more than the model. At STCLab, our SRE team supports…

CNCF's platform is a leading organization driving cloud-native technologies and standards, offering insights into container orchestration, microservices architecture, and cloud-native infrastructure. Through whitepapers, case studies, and community events, CNCF provides insights into adopting cloud-native practices and technologies. Developers and DevOps teams can learn about Kubernetes, Prometheus, and other CNCF projects to build and operate scalable and resilient cloud-native applications.

CNCF

A two-person SRE team at STCLab shares how they built an AI-powered alert investigation pipeline on Amazon EKS using HolmesGPT, Robusta, and the CNCF observability stack (OpenTelemetry, Mimir, Loki, Tempo). The key finding: runbook quality mattered far more than model selection. Adding exclusion rules to namespace-specific runbooks reduced wasted tool calls from 16 to 2 per investigation and raised investigation quality scores from 3.6 to 4.6 out of 5. The pipeline uses a 200-line Python playbook for Slack thread routing, alert deduplication, and timing. Cost runs about $0.04 per investigation (~$12/month). The team also shares lessons on model migration, hybrid self-hosted/managed API setups, and plans to integrate eBPF-level network metrics via Inspektor Gadget.

Auto-diagnosing Kubernetes alerts with HolmesGPT and CNCF tools

HolmesGPT: Letting the LLM decide what to investigate