Best of ObservabilityOctober 2025

  1. 1
    Article
    Avatar of bytebytegoByteByteGo·24w

    How Nubank Built an In-house Logging Platform for 1 Trillion Log Entries

    Nubank built an in-house logging platform to replace a costly third-party vendor, handling 1 trillion daily log entries at 50% lower cost. The solution uses a two-phase architecture: an ingestion pipeline with Fluent Bit, custom buffering, and processing services, plus a query/storage layer combining Trino, AWS S3, and Parquet format. The platform processes 1 petabyte daily, maintains 45 petabytes of searchable data with 45-day retention, and serves 15,000 queries daily scanning 150 petabytes. Key design decisions included decoupling ingestion from querying, implementing micro-batching for reliability, and achieving 95% data compression with Parquet.

  2. 2
    Article
    Avatar of buildkiteBuildkite·24w

    Kubernetes with Buildkite: faster, simpler, and ready for scale

    Buildkite has updated its Kubernetes Agent Stack with simplified installation requiring only a single agent token instead of multiple configuration parameters, improved scaling to handle tens of thousands of concurrent jobs with 80% smaller Kubernetes objects, better error surfacing with full YAML specs and stack-level failure signals, out-of-the-box Prometheus integration for instant observability dashboards, and expanded Helm configuration options. Future improvements include custom scheduling policies, more granular job states, and fine-grained job configuration controls.

  3. 3
    Article
    Avatar of denoDeno·25w

    My highlights from the new Deno Deploy

    Deno Deploy has been rebuilt from scratch with major improvements including integrated CI/CD, simplified database management with KV and Postgres support, built-in metrics and OpenTelemetry observability, enhanced CLI tooling, local-to-production tunneling, cloud service integrations for AWS and GCP, and improved playgrounds. The platform now offers automatic framework detection, environment-specific database provisioning, and centralized configuration management while maintaining both static and dynamic hosting capabilities.

  4. 4
    Article
    Avatar of charityCharity·26w

    Got opinions on observability? I could use your help (once more, with feeling)

    Charity Majors is seeking community input for the second edition of her observability book, specifically requesting experiences and opinions on vendor migrations, cost management strategies for traditional three-pillar architectures, observability team structures, OpenTelemetry adoption decisions, and build-vs-buy considerations. She emphasizes that vendor engineering and software procurement are high-leverage activities requiring deep technical expertise, and shares specific questions about managing observability tools at scale, including migration playbooks, cost control tactics, and instrumentation automation.

  5. 5
    Article
    Avatar of charityCharity·24w

    The Pillar Is A Lie

    A critical examination of the "three pillars of observability" (metrics, logs, traces) framework, arguing it's primarily a marketing construct that leads to expensive, siloed tooling. The piece advocates for unified storage models (observability 2.0) that treat all telemetry signals as interconnected structured data, eliminating the need to hop between separate systems. It clarifies that OpenTelemetry uses "signals" as the technical term, not "pillars," and explains how modern observability tools can store data once while deriving multiple signal types from the same source, reducing costs and cognitive load.

  6. 6
    Article
    Avatar of grafanaGrafana Labs·25w

    Grafana Tempo 2.9 release: MCP server support, TraceQL metrics sampling, and more

    Grafana Tempo 2.9 introduces MCP server support enabling LLMs like Claude to query and analyze distributed tracing data through TraceQL. The release adds query sampling hints to speed up metrics queries by returning approximate results, particularly useful for high-volume environments. New operational improvements include metrics for monitoring query I/O and span timestamp distances, plus enhanced cost-attribution tracking with resource and span-level scoping for multi-tenant deployments. The team is also working on vParquet5 block format and Project Rhythm architecture to improve scalability and reduce operational costs.

  7. 7
    Article
    Avatar of clickhouseClickHouse·27w

    Inside Laravel Nightwatch’s Observability Pipeline: Real-Time Event Processing with Amazon MSK and ClickHouse Cloud

    Laravel Nightwatch processes over 1 billion observability events daily using Amazon MSK and ClickHouse Cloud. The platform combines MSK Express brokers for event ingestion, ClickPipes for streaming data to ClickHouse, and AWS Lambda for validation. This architecture achieves sub-second query latency while handling millions of events per second. On launch day, the system processed 500 million events with 97ms average dashboard latency for 5,300 users. The dual-database design separates transactional workloads on RDS PostgreSQL from analytical workloads on ClickHouse Cloud, enabling horizontal scaling and cost-effective real-time monitoring at global scale.