Scaling Autonomous Site Reliability Engineering: Architecture, Orchestration, and Validation for a 90,000+ Server Fleet

Cloudways built an AI-powered SRE agent called Copilot to handle support and troubleshooting for a fleet of over 90,000 servers and 500,000 applications. The system uses an Insight Generation Engine that loops through context gathering, LLM inference via DigitalOcean's Gradient AI Platform (serverless Claude), and server orchestration via Ansible, Redis, and Celery queues. To combat hallucinations at scale, two validation layers are used: manual random sampling and an LLM-as-judge secondary evaluator. Key lessons include: fine-tuning is rarely needed for SRE tasks, deterministic logic should stay in code, and AI agents must tolerate non-determinism rather than fight it. The platform's Knowledge Bases feature is also used to surface relevant documentation alongside AI-generated insights.

#ai-agents

#ansible

#digitalocean

#kubernetes

Mar 13•7m read time•From digitalocean.com

Table of contents

How does CW Copilot work?What we learned during this journey Why choose DigitalOcean Gradient™ AI Platform Powering AI Agents at Scale

Comment

Bookmark

Copy

Sort: