DigitalOcean's Project Poseidon is an internal ML and GenAI system designed to predict hypervisor failures before they occur across its global cloud fleet. It uses a tiered approach: a fast Stage 1 filter combining lightweight ML models with PromQL-based telemetry and a fine-tuned LLM for semantic BMC/SEL log analysis, eliminating ~98% of nodes from further investigation. Flagged nodes proceed to Stage 2 'Deep Collection', where high-resolution time-series data feeds a Hybrid ML model that calculates crash probability scores. The system runs edge-first across 14 data centers with centralized retraining, optimizing for recall to avoid missing at-risk nodes. It is particularly critical for GPU-heavy workloads on H100, Blackwell, and AMD Instinct hardware powering LLM training and inference.
Table of contents
The Challenge of High-Cardinality TelemetryArchitecture DiagramThe Tiered ApproachThe Feedback Loop: Continuous EvolutionDesign Decisions Worth NamingLooking AheadSort: