Kubernetes monitoring differs fundamentally from VM monitoring due to ephemeral pods, recycled IPs, and massive cardinality explosions. The post covers what to monitor across cluster, pod, and application layers, then compares six tools: Prometheus+Grafana (free, community-backed, but operationally heavy at scale), Datadog (full-featured but expensive with autoscaling), Last9 (cardinality-focused, OTel-native), Grafana Cloud (managed Prometheus stack), Pixie (eBPF zero-instrumentation), and Kubecost (cost allocation). Key advice includes setting resource requests/limits, alerting only on OOMKills, pod restart loops, node NotReady, PV usage, and API server latency. For multi-cluster setups, options range from Thanos federation to centralized remote write. Self-hosted Prometheus works well for small setups but becomes its own ops burden beyond two or three clusters.

17m read timeFrom last9.io
Post cover image

Sort: