etcd failures are a common root cause of Kubernetes control plane degradation, but diagnosing them has historically required deep expertise and manual evidence gathering. The etcd-diagnosis tool addresses this by providing a single command—etcd-diagnosis report—that collects cluster health, disk I/O latency, network RTT between members, and resource pressure signals into a structured diagnostic report. The post covers two major failure modes: database space exhaustion (mvcc: database space exceeded) and slow apply requests, explaining how to identify root causes rather than just applying fixes like compaction. It also distinguishes between quick etcdctl checks for initial triage and full diagnostic reports for deeper investigation. The etcd-recovery tool is presented as a deliberate last resort, with the emphasis on using diagnostics to determine whether recovery is even necessary, allowing automated systems to handle single-member failures when quorum is intact.
Table of contents
Diagnosing and Recovering etcd: Practical tools for Kubernetes OperatorsWhy etcd incidents are so hard to reason aboutFrom symptoms to clarity with etcd-diagnosisQuick checks vs. deep diagnosticsUnderstanding common etcd failure modesRecovery is a last resort, and that’s intentionalBuilding calmer, more predictable operationsReferencesSort: