Best of MonitoringJanuary 2026

  1. 1
    Article
    Avatar of last9Last9·18w

    Logs vs Metrics: A Practical Guide for Engineers

    Metrics and logs serve complementary purposes in production systems. Metrics provide fast, cheap aggregated data for alerting and trend analysis, showing that something is wrong. Logs offer detailed event records for debugging and auditing, revealing what specifically went wrong. The practical workflow combines both: metrics alert you to problems, dashboards confirm patterns, and logs explain root causes. Start with the four golden signals (latency, traffic, errors, saturation) as metrics, use structured JSON logging strategically at service boundaries and for errors, and connect both with request IDs for effective troubleshooting.

  2. 2
    Article
    Avatar of apacheThe Apache Software Foundation Blog·20w

    The Apache Software Foundation Announces New Top-Level Projects

    The Apache Software Foundation has promoted three projects to Top-Level Project status: HertzBeat (an AI-powered observability platform for monitoring and alerting), Teaclave (a secure computing platform using Trusted Execution Environments with Rust-based SDKs), and Training (a repository of open source educational materials for Apache projects). These promotions recognize mature communities that have adopted The Apache Way and demonstrate the foundation's commitment to sustainable open source development.

  3. 3
    Article
    Avatar of datadogDatadog·17w

    Debug PostgreSQL query latency faster with EXPLAIN ANALYZE in Datadog Database Monitoring

    Datadog Database Monitoring now automatically collects PostgreSQL EXPLAIN ANALYZE execution plans to help troubleshoot slow queries. The feature processes plans captured by PostgreSQL's auto_explain extension, correlates them with APM traces, and provides interactive visualizations. Key use cases include identifying incorrect row estimates that cause inefficient join strategies, and analyzing cache hits versus disk reads to determine whether performance issues stem from I/O bottlenecks or query optimization needs.

  4. 4
    Article
    Avatar of modal_labsModal·19w

    Keeping 20,000 GPUs healthy

    Modal manages over 20,000 GPUs across AWS, GCP, Azure, and OCI, encountering significant reliability and performance differences between cloud providers. Their GPU health system includes instance type benchmarking and selection, machine image preparation with automated testing, boot-time validation, and continuous passive monitoring (via DCGM and dmesg) plus weekly active healthchecks (DCGM diag, GPUBurn, NCCL tests). Key findings: Cloud providers vary dramatically in H100 performance (up to 50% differences), temperature management (some reaching 94°C), and ECC error rates. GPUs account for 58.7% of training failures in Meta's LLaMA 3 development, compared to just 0.5% for CPUs, highlighting the reliability gap.

  5. 5
    Article
    Avatar of postgresPostgreSQL·18w

    pgmetrics 1.19 released

    pgmetrics version 1.19 has been released. This open-source tool collects 350+ metrics from PostgreSQL servers and displays them in text format or exports as JSON/CSV. It supports managed PostgreSQL services (AWS, Azure, GCP) and works with Citus, PgBouncer, and Pgpool. The tool is zero-dependency and comes as a single binary.

  6. 6
    Article
    Avatar of influxdbInfluxData·19w

    InfluxData

    Telegraf 1.37 introduces new plugins for querying Grafana Loki (LogQL), Prometheus (PromQL), NFTables monitoring, and system time metrics. New outputs include ARC-DB and heartbeat endpoints, plus secret stores for Google Cloud and HashiCorp Vault. The release emphasizes a shift toward strict environment variable handling by default in v1.38.0, requiring users to verify configurations. Additional improvements include IP filtering for socket listeners, removal of deprecated options, and persistent self-signed certificates for OPCUA plugins.