Beyond the Abyss Project Poseidon’s Quest for Zero-Downtime Reliability

DigitalOcean's Project Poseidon is an internal ML and GenAI system designed to predict hypervisor failures before they occur across its global cloud fleet. It uses a tiered approach: a fast Stage 1 filter combining lightweight ML models with PromQL-based telemetry and a fine-tuned LLM for semantic BMC/SEL log analysis, eliminating ~98% of nodes from further investigation. Flagged nodes proceed to Stage 2 'Deep Collection', where high-resolution time-series data feeds a Hybrid ML model that calculates crash probability scores. The system runs edge-first across 14 data centers with centralized retraining, optimizing for recall to avoid missing at-risk nodes. It is particularly critical for GPU-heavy workloads on H100, Blackwell, and AMD Instinct hardware powering LLM training and inference.

#machine-learning

#llm

Apr 23•8m read time•From digitalocean.com

Table of contents

The Challenge of High-Cardinality Telemetry Architecture Diagram The Tiered Approach The Feedback Loop: Continuous Evolution Design Decisions Worth Naming Looking Ahead

Comment

Bookmark

Copy

Sort: