“Self-healing” IT? HPE research explores how AI-trained models can catch silent infrastructure failures

HPE has published a whitepaper and research exploring IT-optimized time-series foundational models (IT-TSFM) designed to detect 'gray failures' — silent, partial infrastructure degradations that traditional monitoring misses. Unlike generic time-series models, IT-TSFM is trained on infrastructure telemetry to understand IT-specific seasonal patterns and interdependencies, enabling adaptive thresholds rather than static ones. The goal is to move ops teams from reactive to proactive, catching anomalies like zombie services, latency issues, and resource contention before they cause outages costing $4,000+ per minute. The model is intended to work alongside LLMs and agentic AI, with human operators still in the loop, and was developed by HPE Labs as part of its 60th anniversary.

#data-science

#observability

#aiops

Mar 11•8m read time•From thenewstack.io

Table of contents

Costly risk of unknown unknowns Specificity needed for gray failures Will AI just replace the SRE?

Comment

Bookmark

Copy

Sort: