What actually works for Kubernetes monitoring at scale — not what looks good in a vendor demo with a five-pod cluster.

Last9 is a  blog focusing on DevOps practices, cloud architecture, and software engineering methodologies. Through insightful articles, tutorials, and case studies, Last9 addresses various aspects of modern software development, including continuous integration and continuous delivery (CI/CD), infrastructure as code (IaC), containerization, and microservices architecture. By sharing best practices, real-world experiences, and expert insights, Last9 equips developers, DevOps engineers, and IT professionals with the knowledge and tools needed to build, deploy, and manage resilient and scalable cloud-native applications.

Last9

Kubernetes monitoring differs fundamentally from VM monitoring due to ephemeral pods, recycled IPs, and massive cardinality explosions. The post covers what to monitor across cluster, pod, and application layers, then compares six tools: Prometheus+Grafana (free, community-backed, but operationally heavy at scale), Datadog (full-featured but expensive with autoscaling), Last9 (cardinality-focused, OTel-native), Grafana Cloud (managed Prometheus stack), Pixie (eBPF zero-instrumentation), and Kubecost (cost allocation). Key advice includes setting resource requests/limits, alerting only on OOMKills, pod restart loops, node NotReady, PV usage, and API server latency. For multi-cluster setups, options range from Thanos federation to centralized remote write. Self-hosted Prometheus works well for small setups but becomes its own ops burden beyond two or three clusters.

Kubernetes Monitoring Tools: What Actually Works at Scale