Best of Monitoring — January 2026

1
Article
Last9·18w
Logs vs Metrics: A Practical Guide for Engineers
Metrics and logs serve complementary purposes in production systems. Metrics provide fast, cheap aggregated data for alerting and trend analysis, showing that something is wrong. Logs offer detailed event records for debugging and auditing, revealing what specifically went wrong. The practical workflow combines both: metrics alert you to problems, dashboards confirm patterns, and logs explain root causes. Start with the four golden signals (latency, traffic, errors, saturation) as metrics, use structured JSON logging strategically at service boundaries and for errors, and connect both with request IDs for effective troubleshooting.
62
2
Article
The Apache Software Foundation Blog·20w
The Apache Software Foundation Announces New Top-Level Projects
The Apache Software Foundation has promoted three projects to Top-Level Project status: HertzBeat (an AI-powered observability platform for monitoring and alerting), Teaclave (a secure computing platform using Trusted Execution Environments with Rust-based SDKs), and Training (a repository of open source educational materials for Apache projects). These promotions recognize mature communities that have adopted The Apache Way and demonstrate the foundation's commitment to sustainable open source development.
56
1
3
Article
Datadog·17w
Debug PostgreSQL query latency faster with EXPLAIN ANALYZE in Datadog Database Monitoring
Datadog Database Monitoring now automatically collects PostgreSQL EXPLAIN ANALYZE execution plans to help troubleshoot slow queries. The feature processes plans captured by PostgreSQL's auto_explain extension, correlates them with APM traces, and provides interactive visualizations. Key use cases include identifying incorrect row estimates that cause inefficient join strategies, and analyzing cache hits versus disk reads to determine whether performance issues stem from I/O bottlenecks or query optimization needs.
40
4
Article
Modal·19w
Keeping 20,000 GPUs healthy
Modal manages over 20,000 GPUs across AWS, GCP, Azure, and OCI, encountering significant reliability and performance differences between cloud providers. Their GPU health system includes instance type benchmarking and selection, machine image preparation with automated testing, boot-time validation, and continuous passive monitoring (via DCGM and dmesg) plus weekly active healthchecks (DCGM diag, GPUBurn, NCCL tests). Key findings: Cloud providers vary dramatically in H100 performance (up to 50% differences), temperature management (some reaching 94°C), and ECC error rates. GPUs account for 58.7% of training failures in Meta's LLaMA 3 development, compared to just 0.5% for CPUs, highlighting the reliability gap.
40
3
5
Article
PostgreSQL·18w
pgmetrics 1.19 released
pgmetrics version 1.19 has been released. This open-source tool collects 350+ metrics from PostgreSQL servers and displays them in text format or exports as JSON/CSV. It supports managed PostgreSQL services (AWS, Azure, GCP) and works with Citus, PgBouncer, and Pgpool. The tool is zero-dependency and comes as a single binary.
22
6
Article
InfluxData·19w
InfluxData
Telegraf 1.37 introduces new plugins for querying Grafana Loki (LogQL), Prometheus (PromQL), NFTables monitoring, and system time metrics. New outputs include ARC-DB and heartbeat endpoints, plus secret stores for Google Cloud and HashiCorp Vault. The release emphasizes a shift toward strict environment variable handling by default in v1.38.0, requiring users to verify configurations. Additional improvements include IP filtering for socket listeners, removal of deprecated options, and persistent self-signed certificates for OPCUA plugins.
15

See all Monitoring archives