HPE has published a whitepaper and research exploring IT-optimized time-series foundational models (IT-TSFM) designed to detect 'gray failures' — silent, partial infrastructure degradations that traditional monitoring misses. Unlike generic time-series models, IT-TSFM is trained on infrastructure telemetry to understand IT-specific seasonal patterns and interdependencies, enabling adaptive thresholds rather than static ones. The goal is to move ops teams from reactive to proactive, catching anomalies like zombie services, latency issues, and resource contention before they cause outages costing $4,000+ per minute. The model is intended to work alongside LLMs and agentic AI, with human operators still in the loop, and was developed by HPE Labs as part of its 60th anniversary.
Table of contents
Costly risk of unknown unknownsSpecificity needed for gray failuresWill AI just replace the SRE?Sort: