Best of Observability — 2025

1
Video
Fireship·1y
Grafana is the goat... Let's deploy the LGTM stack
Learn how to deploy the LGTM stack to collect and visualize telemetry data from your server using open Telemetry. The stack includes Grafana for visualization, Prometheus for metric storage, Tempo for traces, and Loki for logs, all hosted on a Linux virtual private server. This guide simplifies the setup process using Docker, making it accessible for those new to these technologies.
157
9
2
Article
Hacker News·22w
Your Logs Are Lying To You
Traditional logging practices fail in modern distributed systems because they produce fragmented, context-poor log lines that are difficult to search and correlate. The solution is "wide events" (also called canonical log lines): emitting one comprehensive, structured event per request per service that contains all relevant context—user data, business metrics, infrastructure details, and error information. This approach transforms debugging from text searching into structured querying, enabling complex questions to be answered with simple SQL-like queries. Key implementation strategies include building events throughout the request lifecycle, using tail-based sampling to keep all errors while sampling successful requests, and deliberately instrumenting code with business context rather than relying on auto-instrumentation alone.
154
3
3
Article
Product Hunt·1y
VoltAgent - Build TS AI agents with n8n-style observability
VoltAgent is an open-source TypeScript framework designed to build and orchestrate AI agents with enhanced observability features similar to n8n. It provides developers with flexibility through code-based customization while offering a visual console for debugging and monitoring agent executions. Key features include memory management, multi-agent orchestration, and an LLM-agnostic architecture, making it a versatile tool for developers seeking deeper control and insights into AI workflows.
132
1
4
Article
Last9·48w
11 Best Log Monitoring Tools for Developers in 2025
A comprehensive comparison of 11 log monitoring tools for developers in 2025, covering solutions from simple centralized logging (Papertrail) to enterprise-scale platforms (Datadog, Dynatrace). The guide evaluates each tool's strengths, limitations, pricing, and ideal use cases, while providing practical advice on choosing the right solution based on team size, log volume, and technical requirements. Key tools covered include Last9, Better Stack, Grafana Loki, Elastic Stack, and others, with emphasis on real-world implementation considerations like structured logging, query performance, and cost optimization.
120
3
5
Article
Last9·1y
9 Best Container Monitoring Tools You Should Know in 2025
Container monitoring is essential for managing dynamic containerized environments. This post highlights the top nine container monitoring tools for 2025, including emerging solutions like Last9, open-source options like Prometheus, and enterprise-grade tools like Datadog. Key features, pricing, and developer preferences are discussed to help teams choose the best tool for their needs.
96
6
Article
freeCodeCamp·47w
Top Application Monitoring Tools for Developers
Application Performance Monitoring (APM) tools help developers detect issues before users report them. Five key tools are compared: New Relic offers comprehensive full-stack observability with real-time metrics and traces; Datadog excels in cloud-native environments with seamless integrations and powerful alerting; Prometheus + Grafana provides open-source flexibility with custom dashboards and PromQL querying; Sentry specializes in error tracking with detailed stack traces and breadcrumbs; PostHog combines product analytics with session recording and feature flags. For small teams, start with Sentry for errors and Prometheus for metrics, then consider unified solutions like Datadog or New Relic as you scale.
92
1
7
Article
Hacker News·49w
CI/CD Observability with OpenTelemetry - A Step by Step Guide
OpenTelemetry can provide comprehensive observability for CI/CD pipelines by capturing traces and metrics from GitHub Actions workflows. The setup involves configuring the OpenTelemetry Collector with a GitHub receiver that ingests webhook events as traces and scrapes repository metrics via GitHub APIs. This approach enables end-to-end visibility, performance optimization, error detection, and dependency analysis for CI/CD pipelines, replacing traditional ad-hoc monitoring methods with a unified observability framework.
79
1
8
Article
Grafana Labs·1y
Kubernetes Monitoring Helm chart 2.0: a simpler, more predictable experience
Version 2.0 of the Kubernetes Monitoring Helm chart improves ease of use and flexibility in collecting telemetry data from Kubernetes clusters. Key updates include user-focused feature design, multiple data destinations, built-in integrations for popular services, and compatibility with Fleet Management. Simplified migration from version 1.x is supported by a detailed guide and migration utility.
67
9
Article
freeCodeCamp·49w
How to Debug CI/CD Pipelines: A Handbook on Troubleshooting with Observability Tools
A comprehensive guide to implementing observability in CI/CD pipelines using free and open-source tools. Covers setting up Grafana Loki and lightweight ELK stacks for log aggregation, creating unified logging strategies with correlation IDs, writing advanced LogQL and KQL queries for troubleshooting, integrating Prometheus metrics with logs, and building Grafana dashboards. Includes practical examples for debugging common pipeline failures like build errors, dependency issues, and flaky tests across GitHub Actions, Jenkins, and GitLab.
61
10
Article
Charity·23w
Moving from WordPress to Substack
A developer announces their migration from WordPress to Substack after a decade of blogging. The move is motivated by frustration with WordPress and the desire to join the more vibrant tech writing community on Substack. The author is working on the second edition of "Observability Engineering" and plans to share insights from that process. Email subscribers are being migrated, but comments cannot be transferred, and the original site will remain accessible to preserve existing links.
60
8
11
Article
Coralogix·1y
Using the OpenTelemetry Operator to boost your observability
Discover how the OpenTelemetry Operator simplifies observability in Kubernetes by auto-collecting trace data and integrating with Coralogix using Helm charts. This setup requires no code changes, offering easy auto-instrumentation for supported languages, enabling trace data visualization and analysis without extensive configuration.
58
12
Article
OpenTelemetry·1y
OpenTelemetry Logging and You
OpenTelemetry provides a comprehensive logging framework that includes logs, events, and spans. Logs are any telemetry data emitted through a log pipeline via the Logs API, while events are a specific type of log with a defined schema. Spans differ from events by having durations and hierarchical relationships. The design emphasizes correlating all telemetry signals through context for a cohesive observability solution.
58
13
Article
Go Developers·1y
Golang + observability = backend applications that you'll love ❤️
Yokai is a simple, modular, and observable Go framework designed to help developers avoid boilerplate code in production-grade backend applications. It includes features such as logs, traces, metrics, health checks, and config management. The framework offers demo applications and tutorials to assist developers in getting started.
58
14
Article
Hacker News·27w
The Grafana trust problem
An experienced engineer shares their journey with Grafana's observability stack, detailing how frequent architectural changes, deprecations, and increasing complexity have eroded trust. Starting with simple Loki/Prometheus setups, they've witnessed rapid product churn—Grafana Agent deprecated within 2-3 years, OnCall discontinued, and Mimir 3.0 now requiring Kafka. The constant restructuring, incompatibilities with Prometheus Operator standards, and career-driven development pace make it difficult to maintain stable monitoring infrastructure. While acknowledging the technical quality of Grafana products, the author questions their long-term viability and considers alternatives like the kube-prometheus-stack with Thanos.
57
4
15
Article
Grafana Labs·27w
A Star Wars dashboard deep dive: How to build your next visualization in less than 12 parsecs
A detailed walkthrough of building a Star Wars-themed Grafana dashboard, covering practical techniques like using stat panels for custom text styling, TestData plugin for simulating dynamic data, canvas panels for creating custom visualizations with animations, and styling approaches for visual consistency. Demonstrates how to create gauges, charts, maps, and custom layouts while explaining the technical implementation behind each component.
57
16
Article
Grafana Labs·27w
Understand, diagnose, and optimize SQL queries: Introducing Grafana Cloud Database Observability
Grafana Cloud Database Observability is now in public preview, offering developers, SREs, and DBAs tools to understand, diagnose, and optimize SQL queries. The solution addresses the visibility gap in database performance by providing query-level insights, execution plans, wait event analysis, and AI-powered optimization suggestions. It supports MySQL and PostgreSQL, integrates with Grafana Alloy for telemetry collection, and correlates database metrics with application and infrastructure data for comprehensive system-wide performance analysis.
51
17
Article
CNCF·24w
Building microservices the easy way with Dapr
Dapr is a CNCF graduated project that simplifies microservices development by providing a sidecar runtime that handles distributed system concerns like messaging, pub-sub, service communication, storage, and secrets management. Built with observability in mind, Dapr automatically propagates traces and metrics across asynchronous and synchronous systems without requiring manual instrumentation. Recent additions include workflow orchestration, AI/LLM integration through a Conversation API, and Dapr Agents for durable autonomous workflows. The project was open source from inception, joined CNCF as an incubating project in 2021, and graduated in October 2024 with thousands of contributors from hundreds of organizations.
50
18
Article
ByteByteGo·30w
How Nubank Built an In-house Logging Platform for 1 Trillion Log Entries
Nubank built an in-house logging platform to replace a costly third-party vendor, handling 1 trillion daily log entries at 50% lower cost. The solution uses a two-phase architecture: an ingestion pipeline with Fluent Bit, custom buffering, and processing services, plus a query/storage layer combining Trino, AWS S3, and Parquet format. The platform processes 1 petabyte daily, maintains 45 petabytes of searchable data with 45-day retention, and serves 15,000 queries daily scanning 150 petabytes. Key design decisions included decoupling ingestion from querying, implementing micro-batching for reliability, and achieving 95% data compression with Parquet.
49
19
Article
Grafana Labs·29w
Grafana Mimir 3.0 release: performance improvements, a new query engine, and more
Grafana Mimir 3.0 introduces a redesigned architecture that separates read and write operations using Apache Kafka as an asynchronous buffer, eliminating performance bottlenecks between ingestion and queries. The release features the Mimir Query Engine (MQE), which processes queries in a streaming fashion rather than bulk loading, reducing peak memory usage by up to 92%. These improvements deliver 15% lower resource usage in large clusters while maintaining faster query execution and higher reliability. The new ingest storage component ensures query spikes won't slow down data ingestion and vice versa, enabling independent scaling of each path.
47
1
20
Article
Platformatic·35w
Open Source Node.js Command Center Released
Platformatic has open-sourced their Intelligent Command Center (ICC), a specialized tool designed to optimize Node.js applications running in Kubernetes. The platform addresses common issues with traditional autoscaling by monitoring Node.js-specific metrics like Event Loop Utilization and heap usage instead of generic CPU/memory metrics. ICC provides predictive autoscaling, comprehensive observability, advanced caching infrastructure, and can reduce resource usage by up to 30% while improving performance. The tool integrates with existing Kubernetes infrastructure and includes features like flamegraph profiling, cache analytics, and unified dashboards for operational visibility.
47
21
Article
Last9·1y
How to Set Up Logging in Node.js (Without Overthinking It)
Proper logging in Node.js is critical for debugging, performance insights, security, and operational intelligence. The post outlines the importance of logging, recommends tools like Winston, Pino, and Morgan, and provides a guide to setting up and utilizing these libraries effectively. It also discusses advanced logging strategies, error handling, and integrations with visualization and monitoring tools like Last9, Kibana, and Grafana.
47
2
22
Article
Buildkite·30w
Kubernetes with Buildkite: faster, simpler, and ready for scale
Buildkite has updated its Kubernetes Agent Stack with simplified installation requiring only a single agent token instead of multiple configuration parameters, improved scaling to handle tens of thousands of concurrent jobs with 80% smaller Kubernetes objects, better error surfacing with full YAML specs and stack-level failure signals, out-of-the-box Prometheus integration for instant observability dashboards, and expanded Helm configuration options. Future improvements include custom scheduling policies, more granular job states, and fine-grained job configuration controls.
45
23
Article
InfluxData·27w
Introducing the New Cloud Dedicated Admin UI
InfluxData has released a major update to the Cloud Dedicated Admin UI, introducing live cluster observability dashboards with CPU, memory, and request rate metrics. The update includes redesigned navigation for quick access to databases, tables, and tokens, plus enhanced table schema browsing with column type filtering. Users can now monitor cluster performance across different time periods and switch between multiple accounts and clusters directly from the interface.
42
1
24
Article
Grafana Labs·1y
New in Grafana 12: Dynamic dashboards that are smarter, easier to edit, and can be customized for teams
Grafana 12 introduces dynamic dashboards designed to enhance ease of navigation, editability, and customization. Key features include tabs for segmenting data by context, a dashboard outline for quick navigation, and conditional rendering to reduce visual clutter. The release also features a new schema inspired by Kubernetes CRD to improve dashboard management, and a context-aware editing interface to streamline user workflows. These enhancements aim to address scalability challenges in large organizations and improve overall user experience.
42
25
Article
Milan Jovanović·48w
Monitoring .NET Applications with OpenTelemetry and Grafana
Learn how to implement comprehensive observability for .NET applications using OpenTelemetry and Grafana Cloud. The guide covers installing OpenTelemetry packages, configuring automatic instrumentation for ASP.NET Core, Entity Framework, and other libraries, setting up OTLP export to Grafana Cloud, and viewing traces and logs in unified dashboards. This setup provides distributed tracing, log correlation, and monitoring capabilities that scale from single services to complex microservice architectures.
40

See all Observability archives