Grab built Kinabalu AI SRE, an AI-assisted platform that consolidates fragmented operational context and automates incident diagnostics. The system combines a signal aggregator that ingests data from monitoring tools (Datadog, Kibana, Grafana), a Model Context Protocol toolkit integrating 25 operational tools, and LLM-powered agents that perform auto-triage and root cause analysis. It offers both static diagnostics (comprehensive service health reports from six domain-specific sub-agents) and dynamic chat (real-time queries via Slack/Web UI). The platform aims to reduce time-to-resolution by providing evidence-backed insights, correlating signals across metrics, logs, dependencies, and recent changes, while keeping engineers focused on decision-making rather than context-switching.

6m read timeFrom engineering.grab.com
Post cover image
Table of contents
IntroductionBackgroundA typical user journeyArchitectureConclusionJoin us

Sort: