Best of Observability — January 2026

1
Article
Last9·18w
Logs vs Metrics: A Practical Guide for Engineers
Metrics and logs serve complementary purposes in production systems. Metrics provide fast, cheap aggregated data for alerting and trend analysis, showing that something is wrong. Logs offer detailed event records for debugging and auditing, revealing what specifically went wrong. The practical workflow combines both: metrics alert you to problems, dashboards confirm patterns, and logs explain root causes. Start with the four golden signals (latency, traffic, errors, saturation) as metrics, use structured JSON logging strategically at service boundaries and for errors, and connect both with request IDs for effective troubleshooting.
62
2
Article
OpenTelemetry·18w
OpenTelemetry JS Statement on Node.js DOS Mitigation
OpenTelemetry clarifies that a recent Node.js denial-of-service advisory involving async_hooks is not a vulnerability in OpenTelemetry itself. The issue stems from applications relying on unspecified stack space exhaustion behavior. Node.js has fixed this in version 20.20.0 and newer, but the fix won't be backported to Node.js 18. Users should upgrade to Node.js 20+ as the recommended mitigation, with no OpenTelemetry-specific changes required.
23
3
Article
ClickHouse·21w
Your AI SRE needs better observability, not bigger models.
AI SRE tools fail not because of weak models, but because they lack proper observability foundations. Legacy systems with short retention windows, dropped high-cardinality data, and slow queries prevent AI from performing effective root cause analysis. ClickHouse's columnar architecture enables long-retention, high-cardinality observability at scale with sub-second query speeds, making it ideal for AI SRE copilots. The article presents a reference architecture combining ClickHouse with context layers (deployments, topology, incident history) and LLMs via SQL to create an investigative copilot that correlates data and surfaces insights while keeping humans in control of remediation decisions.
21
4
Article
LangChain·19w
In software, the code documents the app. In AI, the traces do.
AI agents shift the source of truth from code to runtime traces because decision logic happens inside the model, not in your codebase. Unlike traditional software where code documents behavior deterministically, agent code is just scaffolding that orchestrates LLM calls. This fundamental change means debugging becomes trace analysis, testing requires continuous evals in production, performance optimization focuses on decision patterns rather than algorithms, and collaboration moves to observability platforms. Without structured tracing to capture reasoning chains, tool calls, and decision points, you're working blind since the actual intelligence lives in traces, not code.
16

See all Observability archives