A two-person SRE team at STCLab shares how they built an AI-powered alert investigation pipeline on Amazon EKS using HolmesGPT, Robusta, and the CNCF observability stack (OpenTelemetry, Mimir, Loki, Tempo). The key finding: runbook quality mattered far more than model selection. Adding exclusion rules to namespace-specific runbooks reduced wasted tool calls from 16 to 2 per investigation and raised investigation quality scores from 3.6 to 4.6 out of 5. The pipeline uses a 200-line Python playbook for Slack thread routing, alert deduplication, and timing. Cost runs about $0.04 per investigation (~$12/month). The team also shares lessons on model migration, hybrid self-hosted/managed API setups, and plans to integrate eBPF-level network metrics via Inspektor Gadget.

6m read timeFrom cncf.io
Post cover image
Table of contents
Why we built thisHolmesGPT: Letting the LLM decide what to investigateMaking it work with RobustaRunbooks changed everythingThe model journeyWhat actually mattered

Sort: