Making etcd incidents easier to debug in production Kubernetes

etcd failures are a common root cause of Kubernetes control plane degradation, but diagnosing them has historically required deep expertise and manual evidence gathering. The etcd-diagnosis tool addresses this by providing a single command—etcd-diagnosis report—that collects cluster health, disk I/O latency, network RTT between members, and resource pressure signals into a structured diagnostic report. The post covers two major failure modes: database space exhaustion (mvcc: database space exceeded) and slow apply requests, explaining how to identify root causes rather than just applying fixes like compaction. It also distinguishes between quick etcdctl checks for initial triage and full diagnostic reports for deeper investigation. The etcd-recovery tool is presented as a deliberate last resort, with the emphasis on using diagnostics to determine whether recovery is even necessary, allowing automated systems to handle single-member failures when quorum is intact.

#kubernetes

#distributed-systems

Mar 12•5m read time•From cncf.io

Table of contents

Diagnosing and Recovering etcd: Practical tools for Kubernetes Operators Why etcd incidents are so hard to reason about From symptoms to clarity with etcd-diagnosis Quick checks vs. deep diagnostics Understanding common etcd failure modes Recovery is a last resort, and that’s intentional Building calmer, more predictable operations References

Comment

Bookmark

Copy

Sort: