Best of Distributed SystemsSeptember 2025

  1. 1
    Article
    Avatar of muratbuffaloMetadata·37w

    Disaggregation: A New Architecture for Cloud Databases

    Disaggregated database architecture separates compute and storage into independent, scalable components to better exploit cloud elasticity. This approach addresses the asymmetry between expensive, fluctuating compute resources and cheaper, stable storage. Modern systems like Snowflake and Aurora demonstrate this pattern, with newer implementations pushing disaggregation further into specialized services. While disaggregation enables better resource utilization and cost optimization, it introduces performance tradeoffs due to network communication overhead. The architecture also opens opportunities to rethink distributed protocols and enables new capabilities like real-time HTAP systems and specialized hardware adoption.

  2. 2
    Article
    Avatar of netflixNetflix TechBlog·34w

    Building a Resilient Data Platform with Write-Ahead Log at Netflix

    Netflix built a generic Write-Ahead Log (WAL) system to solve data consistency and reliability challenges at scale. The system provides a simple API that abstracts underlying message queues (Kafka, SQS) and supports multiple use cases including delayed queues, cross-region replication, and multi-partition mutations. WAL prevents data loss, handles system entropy across different datastores, and enables reliable retry mechanisms for real-time data pipelines. The architecture separates message producers from consumers, uses configurable namespaces for logical separation, and leverages Netflix's Data Gateway infrastructure for deployment. Key applications include EVCache cross-region replication, Live Origin's delayed delete operations, and Key-Value service's MutateItems API with two-phase commit semantics.

  3. 3
    Article
    Avatar of javarevisitedJavarevisited·36w

    How I Combined ByteByteGo and Codemia.io to Crack System Design Interviews in 2025

    A developer shares their successful strategy for system design interview preparation by combining ByteByteGo's visual learning approach with Codemia.io's hands-on practice platform. ByteByteGo provided conceptual understanding through industry-leading diagrams of real systems like Twitter and Uber, while Codemia.io offered practical experience with actual interview problems and AI-driven feedback. This dual approach helped the author secure multiple FAANG offers by mastering both theoretical concepts and practical application skills.

  4. 4
    Article
    Avatar of hnHacker News·35w

    Why Local-First Apps Haven’t Become Popular?

    Local-first apps promise instant loading and privacy but remain uncommon due to synchronization challenges. Building offline-capable applications creates distributed systems where multiple devices modify data independently, requiring solutions for unreliable event ordering and data conflicts. Hybrid Logical Clocks (HLCs) solve ordering issues by combining physical and logical timestamps, while Conflict-Free Replicated Data Types (CRDTs) handle conflicts through strategies like Last-Write-Wins. SQLite serves as an ideal foundation for local-first architectures, enabling reliable offline functionality through message-based synchronization that guarantees eventual consistency across devices.

  5. 5
    Article
    Avatar of systemdesignnewsSystem Design Newsletter·34w

    How Kafka Works

    Apache Kafka is a distributed, fault-tolerant pub/sub messaging system built on a simple log data structure. It uses brokers for horizontal scaling, partitions for data sharding, and replication for durability. The system employs KRaft consensus for leader election and metadata management. Key features include tiered storage for cost optimization, consumer groups for parallel processing, transactions for exactly-once semantics, and ecosystem components like Kafka Streams for stream processing and Kafka Connect for system integration.

  6. 6
    Article
    Avatar of oxideOxide·35w

    Systems Software in the Large

    Systems software development becomes exponentially more challenging when scaled to large, multi-component projects. The intersection of systems programming complexity with large-scale software development creates some of the most difficult engineering problems. Oxide's software update system exemplifies this challenge, requiring dynamic updates of distributed systems while maintaining operability and working across air-gapped environments. The post highlights lessons learned from leading such projects, including managing scope creep, organizational procrastination, and technical decision-making in complex systems.

  7. 7
    Article
    Avatar of systemdesigncodexSystem Design Codex·33w

    How Amazon S3 Works Behind the Scenes

    Amazon S3 processes millions of requests per second and stores over 350 trillion objects using a microservices architecture. The system consists of five main layers: front-end services for request handling and authentication, metadata services for object indexing, storage services using erasure coding across multiple availability zones, durability services with checksums and auditing, and security services with IAM policies and encryption. This modular approach enables independent scaling and updates while achieving 11 nines of durability.

  8. 8
    Article
    Avatar of p99confP99 Conf·35w

    Books by P99 CONF Speakers: AI Engineering, Latency, Distribtuted Systems & More

    A curated collection of technical books authored by P99 CONF speakers covering AI engineering, distributed systems, database performance optimization, latency reduction, and PostgreSQL query optimization. The books range from foundational concepts to advanced implementation techniques, with special discounts available through conference sponsors O'Reilly and Manning Publications.

  9. 9
    Article
    Avatar of dotnet.NET Blog·34w

    Announcing Aspire 9.5

    Aspire 9.5 introduces a preview 'aspire update' command for automatic upgrades, single-file AppHost support that eliminates the need for project files, enhanced dashboard with multi-resource console logs and GenAI visualizer, and new integrations for OpenAI, Azure Dev Tunnels, and YARP static file serving. The release focuses on simplifying the developer experience for building distributed applications.

  10. 10
    Article
    Avatar of infoworldInfoWorld·35w

    Advanced debug logging techniques: A technical guide

    Debug logging is essential for maintaining high-performance applications across different architectures. Effective debug logging requires being selective about what to log, using structured formats like JSON, including contextual information such as correlation IDs, and implementing techniques like parameterized logging and rate limiting. Key practices include avoiding over-logging, never logging sensitive data, maintaining consistent formatting, and using centralized log management platforms. The guide covers specific tools for different languages (Winston for Node.js, structlog for Python, SLF4J for Java) and emphasizes the importance of correlation IDs for distributed tracing in microservice environments.

  11. 11
    Article
    Avatar of hnHacker News·36w

    pgEdge goes Open Source

    pgEdge has relicensed all core components of their Distributed Postgres platform from a proprietary pgEdge Community License to the permissive PostgreSQL License. This includes their replication engine Spock and extensions like Snowflake and Lolor. The change makes pgEdge's distributed PostgreSQL technology fully open source, allowing unrestricted use and modification of the source code.

  12. 12
    Article
    Avatar of netflixNetflix TechBlog·34w

    100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine

    Netflix redesigned their Maestro workflow orchestrator engine, achieving 100x performance improvement by replacing the stateless worker model with a stateful actor-based architecture using Java virtual threads. The new design reduces overhead from seconds to milliseconds, maintains in-memory state for better locality, implements strong execution guarantees, and simplifies the architecture by removing dependencies on external distributed queues and multiple databases.

  13. 13
    Video
    Avatar of bytebytegoByteByteGo·35w

    FAANG System Design Interview: Design A Chat System (WhatsApp, Facebook Messenger, Discord, Slack)

    A comprehensive guide to designing a scalable chat system like WhatsApp or Discord. Covers core architecture decisions including HTTP for sending messages and WebSocket for receiving, the inbox pattern for offline message delivery, service discovery for routing between chat servers, fan-out patterns for group messaging, and presence tracking with heartbeats. Discusses scaling challenges from thousands to millions of users, including connection bottlenecks, database sharding, and multi-region deployment.

  14. 14
    Article
    Avatar of milanjovanovicMilan Jovanović·35w

    Distributed Locking in .NET: Coordinating Work Across Multiple Instances

    Distributed locking solves coordination problems when applications run across multiple instances. While .NET provides concurrency primitives for single processes, distributed systems need specialized solutions to prevent race conditions and ensure data consistency. PostgreSQL advisory locks offer a simple DIY approach, while the DistributedLock library provides production-ready features with support for multiple backends including Postgres, Redis, and SQL Server.

  15. 15
    Article
    Avatar of googledevsGoogle Developers·36w

    A2A Extensions: Empowering Custom Agent Functionality

    A2A Extensions enable developers to add custom functionalities to agent-to-agent communication beyond the core A2A protocol. Extensions are declared in Agent Cards and identified by unique URIs, creating an open ecosystem. Real-world implementations include traceability extensions for debugging agent interactions, Twilio's latency-aware extensions for voice agents, and Identity Machines' zero-trust handshakes for secure task delegation.