Best of ObservabilityApril 2026

  1. 1
    Article
    Avatar of grafanaGrafana Labs·6w

    Kubernetes Monitoring Helm chart v4: Biggest update ever!

    Grafana's Kubernetes Monitoring Helm chart v4 is a major overhaul addressing real pain points from v3. Key changes include: converting destinations and collectors from lists to maps (enabling proper multi-file merging and named overrides), replacing hard-coded collector names with user-defined collectors using composable presets, making telemetry service deployments explicit to avoid surprise duplicates, splitting the overloaded clusterMetrics feature into three focused features, separating pod log collection methods into distinct features with native OTLP support, replacing the bulk labelsToKeep approach with explicit opt-in label declarations (reducing memory usage), and allowing granular control over individual profiler types. A migration tool is available to convert v3 values files to v4 format automatically.

  2. 2
    Article
    Avatar of grafanaGrafana Labs·7w

    Observability in Go: Where to start and what matters most

    Engineers from Grafana Labs and Isovalent discuss practical observability strategies for Go systems. The conversation covers starting with logs and deriving metrics from them (e.g., counting panics), when to reach for distributed tracing and how context propagation works, Go's error handling tradeoffs for observability, effective use of pprof for CPU and memory profiling (including common pitfalls like profiling when the bottleneck is actually I/O wait), and how eBPF enables visibility into kernel-level behavior beyond what user-space instrumentation can provide.

  3. 3
    Article
    Avatar of freecodecampfreeCodeCamp·4w

    From Metrics to Meaning: How PaaS Helps Developers Understand Production

    Modern production systems generate overwhelming amounts of data, but more metrics don't automatically mean better understanding. The real problem is interpretation, not observability. PaaS platforms like Sevalla, Railway, and Render help by abstracting away infrastructure concerns so that five key metrics — latency, error rate, throughput, resource utilization, and instance health — map more directly to application behavior. Instead of chasing cross-layer infrastructure issues, developers can focus on code, queries, and dependencies. The result is fewer variables to reason about and clearer signals that reduce the gap between symptom and cause.

  4. 4
    Article
    Avatar of grafanaGrafana Labs·5w

    Introducing Pyroscope 2.0: faster, more cost-effective continuous profiling at scale

    Pyroscope 2.0 is a ground-up rearchitecture of the open source continuous profiling database. Key changes include eliminating write-path replication (each profile written once to object storage instead of 3x), data co-location that reduces symbol storage by up to 95%, and a fully stateless read path enabling elastic query scaling. Rollouts that previously took 8-12 hours now complete in minutes. The new architecture also enables metrics derived from profiles, individual profile inspection, and heatmap queries. Pyroscope 2.0 has been running in Grafana Cloud since April 2025, processing 19.5PB of data. Native OTLP profiling support is included. Object storage is now required for distributed deployments.

  5. 5
    Article
    Avatar of portkeyportkey·5w

    What is AIOps?

    AIOps for LLM systems addresses the gap between traditional infrastructure monitoring and the operational needs of production AI. Standard monitoring confirms systems are running but misses output drift, cost spikes, and request-level failures. AIOps introduces a control layer between applications and model providers that enables end-to-end request tracing, runtime routing and policy enforcement, proactive cost controls, and governance with full auditability. Practical implementation involves a gateway that intercepts every request, applies routing rules, enforces usage limits, and logs full execution context. Teams benefit from faster debugging, predictable costs, and consistent model behavior.

  6. 6
    Article
    Avatar of grafanaGrafana Labs·6w

    Grafana Alerting: Respond faster and get situational awareness with alert enrichment in Grafana Cloud

    Grafana Cloud has introduced alert enrichment, a new public preview feature for Grafana Alerting that attaches contextual information to alerts before they reach on-call responders. Instead of bare signals like 'CPU usage is high,' enriched alerts can include relevant log lines, AI-powered explanations, links to dashboards, automated ML investigations via Sift or Assistant, dynamic templated annotations, and data fetched from external APIs or data sources like Loki and Mimir. Seven enricher types are available: Assign, External, Datasource, Sift, Knowledge Graph, Explain, and Assistant. Enrichments can be scoped per alert rule or applied globally across all alerts by label/annotation filters. The goal is to automate the first triage steps so engineers can focus on resolution rather than context-gathering.

  7. 7
    Article
    Avatar of opentelemetryOpenTelemetry·5w

    How Skyscanner scales OpenTelemetry: managing collectors across 24 production clusters

    Skyscanner's platform engineering team shares how they manage OpenTelemetry Collector deployments across 24 production Kubernetes clusters running 1,000+ microservices. Key architectural decisions include a centralized DNS endpoint with Istio-based routing, two collector patterns (Gateway ReplicaSet and Agent DaemonSet), and generating platform-level HTTP/gRPC metrics from Istio service mesh spans using the span metrics connector — eliminating the need for application-level instrumentation. The Java-heavy environment uses a shared base Docker image with the OTel Java agent pre-configured, with all instrumentations disabled by default and only a curated set enabled. SDK-generated HTTP/RPC metrics are dropped in favor of lower-cardinality Istio-derived metrics. Rollouts follow a progressive promotion strategy across dev, alpha, beta, and production cluster tiers using Argo CD. Practical advice includes starting simple, adding memory limiters from day one, and using filter processors early to handle false-positive error statuses.

  8. 8
    Article
    Avatar of grafanaGrafana Labs·5w

    Introducing o11y-bench: an open benchmark for AI agents running observability workflows

    Grafana Labs has open sourced o11y-bench, a benchmark for evaluating AI agents on observability workflows. Built on the Harbor framework, it runs agents against a real Grafana stack with synthetic metrics, logs, and traces, then grades them on 63 tasks spanning PromQL queries, LogQL, TraceQL, multi-step incident investigations, and dashboard editing. The benchmark uses two headline metrics — Pass^3 (consistency across three runs) and Pass@3 (best-of-three success) — prioritizing reliability over one-off success. Initial results across 29 model variants showed Claude Opus 4.7 (reasoning off) leading on consistency, with Qwen 3.6 Plus as the top open-source model. Dashboarding tasks proved hardest due to combined state, query correctness, and variable wiring requirements. The project is open source and accepts community contributions to its HuggingFace leaderboard.