NVSentinel is an open source monitoring and self-healing system for Kubernetes clusters running GPU workloads. It continuously monitors GPU health using NVIDIA DCGM diagnostics, detects issues like ECC errors and driver crashes, and automatically remediates problems through actions like node quarantine, draining, and triggering
Table of contents
A health system for Kubernetes GPU clustersHow does NVSentinel work?How to get started with NVSentinelGet involved with NVSentinelSort: