Cloudways built an AI-powered SRE agent called Copilot to handle support and troubleshooting for a fleet of over 90,000 servers and 500,000 applications. The system uses an Insight Generation Engine that loops through context gathering, LLM inference via DigitalOcean's Gradient AI Platform (serverless Claude), and server orchestration via Ansible, Redis, and Celery queues. To combat hallucinations at scale, two validation layers are used: manual random sampling and an LLM-as-judge secondary evaluator. Key lessons include: fine-tuning is rarely needed for SRE tasks, deterministic logic should stay in code, and AI agents must tolerate non-determinism rather than fight it. The platform's Knowledge Bases feature is also used to surface relevant documentation alongside AI-generated insights.
Table of contents
How does CW Copilot work?What we learned during this journeyWhy choose DigitalOcean Gradient™ AI PlatformPowering AI Agents at ScaleSort: