Best of ObservabilityMarch 2026

  1. 1
    Article
    Avatar of tokioTokio·4w

    Introducing dial9: a flight recorder for Tokio

    dial9 is a new runtime telemetry tool for Tokio that captures a full timeline of runtime events — individual polls, parks, wakes, and Linux kernel events — rather than just aggregate metrics. Built to diagnose production-only performance issues, it helped identify kernel scheduling delays of 10ms+ on an AWS service, fd_table lock contention causing 100ms+ polls during startup, and a global mutex in backtrace::trace. With under 5% overhead, it can run continuously in production. Setup requires wrapping the Tokio runtime with TracedRuntime and traces can be viewed in a browser-based viewer or stored to S3.

  2. 2
    Article
    Avatar of supabaseSupabase·6w

    Log Drains: Now available on Pro

    Supabase has launched Log Drains for Pro tier users, allowing them to forward logs from all Supabase infrastructure layers (Postgres, Auth, Storage, Edge Functions, Realtime, API Gateway) to external logging backends. Supported destinations include Sentry, Grafana Loki, Datadog, AWS S3, Axiom, and any generic HTTP endpoint. Logs are batched (up to 250 per batch or every second) with optional Gzip compression. Pricing starts at $10 per drain per project plus $0.20 per million events and $0.09 per GB egress. The feature enables centralized observability across the full stack without context-switching between the Supabase console and existing monitoring tools.

  3. 3
    Article
    Avatar of grafanaGrafana Labs·6w

    OpenTelemetry support for .NET 10: A behind-the-scenes look

    Grafana Labs engineers worked with the OpenTelemetry .NET community to deliver native .NET 10 support for the OpenTelemetry instrumentation libraries on the same week as .NET 10's stable release in November 2025. The post covers the value of early preview validation (catching a logging source generator regression before RC2), key changes in the new release including schema URL support for metrics and traces, new ASP.NET Core metrics for authentication, Blazor, and memory pools, and a breaking change where the default trace context propagator switched to W3C. It also flags a binding redirect gotcha for .NET Framework apps using .NET 10 assemblies.

  4. 4
    Article
    Avatar of tinybirdTinybird·3w

    Maple: an open-source observability platform built with Tinybird's TypeScript SDK

    David Granzin, a Senior Systems Engineer at Superwall, built Maple — an open-source observability platform for distributed traces, logs, and metrics — using Tinybird's TypeScript SDK on top of managed ClickHouse. With no infrastructure to provision, Tinybird's local-first development, schema visualization, branch environments, and TypeScript SDK cut his estimated 12-week build down to 7 weeks. The TypeScript SDK also enabled AI coding agents to read and modify Tinybird resources directly, accelerating iteration. In the same timeframe, David also built audit log infrastructure for a second project, Hazel, an AI-first Slack alternative.

  5. 5
    Article
    Avatar of mondaymonday Engineering·4w

    Less Noise, Better Sleep: Data-Driven Approach to Healthier Alerts

    Monday.com's engineering team tackled alert fatigue and noisy on-call schedules by treating production alerts as data. They built a PagerDuty integration to log every alert into a monday.com board, enriched raw data with context (true/false positive, root cause, duplicates), reviewed trends quarterly in engineering reviews, and converted findings into prioritized backlog items. Over a year, this framework cut false-positive alerts by 2x and improved system resiliency across a 20-engineer group.

  6. 6
    Article
    Avatar of opentelemetryOpenTelemetry·4w

    How Mastodon Runs OpenTelemetry Collectors in Production

    Mastodon, a non-profit decentralized social platform with ~20 staff, shares how a single engineer runs OpenTelemetry Collectors in production across two large Kubernetes deployments handling up to 10 million requests per minute. The setup uses one Collector per Kubernetes namespace, managed via the OpenTelemetry Operator and Argo CD, with no complex gateway tiers. Traffic is controlled through tail-based sampling (0.1% for successful traces, 100% for errors). The full production config is shared, including OTLP ingestion, Kubernetes metadata enrichment, resource detection, and Datadog export. Key lessons: keep architecture simple, use Kubernetes operators for lifecycle management, rely on semantic conventions, and upgrade frequently.

  7. 7
    Article
    Avatar of grafanaGrafana Labs·6w

    Build, buy, or open source? Understanding your options with Grafana’s AI-powered observability

    A framework for choosing between building, buying, or using open source for AI-powered observability. The three options are framed as lanes: building in-house offers maximum control but requires significant ongoing investment; open source provides a flexible foundation but leaves teams responsible for AI orchestration, prompt design, and evaluation; and managed solutions like Grafana Cloud offer out-of-the-box AI capabilities including Grafana Assistant and Assistant Investigations. The post argues these aren't mutually exclusive—Grafana's ecosystem allows teams to combine approaches, shift strategies over time, and extend managed AI with custom agents and internal knowledge sources via open APIs.

  8. 8
    Article
    Avatar of kubesquadKubeSquad·3w

    Instrumenting Rust TLS with eBPF

    Instrumenting rustls with eBPF for L7 traffic capture is non-trivial because rustls separates TLS operations from socket I/O, unlike OpenSSL. On the write path, the OpenSSL-style correlation still works. On the read path, the order is inverted: the syscall fires before plaintext is available, requiring reverse correlation by stashing the file descriptor on recvfrom and attaching it later when reader.read fires. An additional gotcha is that Rust's Result<usize> return convention stores success/error in rax and byte count in rdx, not rax alone. Symbol detection is also tricky due to unstable Rust name mangling, requiring pattern-based scanning rather than exact name matching. These techniques are implemented in Coroot, an open source eBPF-based observability tool, enabling automatic TLS decryption for hyper, axum, sqlx, tokio-rustls, and other rustls-backed libraries with no code changes.

  9. 9
    Article
    Avatar of opentelemetryOpenTelemetry·6w

    OTTL context inference comes to the Filter Processor

    OpenTelemetry collector-contrib v0.146.0 introduces OTTL context inference to the Filter Processor, previously only available in the Transform Processor. Four new top-level config fields — trace_conditions, metric_conditions, log_conditions, and profile_conditions — let users write flat condition lists without manually organizing them into context blocks like resource, span, or spanevent. The processor automatically infers the execution context from path prefixes, combines conditions with logical OR, and executes them hierarchically (higher-level matches short-circuit lower-level evaluation for performance). Advanced grouping with per-group error_mode settings is also supported. The legacy configuration format remains fully supported.