Best of Distributed SystemsNovember 2025

  1. 1
    Article
    Avatar of javarevisitedJavarevisited·29w

    6 Must-Read Books for Backend Developers in 2026

    A curated list of six essential books for backend developers covering software architecture, design patterns, distributed systems, microservices, and data engineering. The recommendations include classics like "Designing Data-Intensive Applications" by Martin Kleppmann, "The Pragmatic Programmer," and "Building Microservices" by Sam Newman, focusing on fundamental principles that remain relevant despite changing frameworks and tools. Each book addresses critical aspects of backend development from API design and scalability to data pipelines and architectural trade-offs.

  2. 2
    Article
    Avatar of dockerDocker·25w

    You Want Microservices—But Do You Need Them?

    Microservices have become the default architectural choice despite their significant complexity costs. Amazon Prime Video achieved 90% cost reduction by reverting to a monolith, while companies like Twilio Segment and Shopify found success with simpler architectures. Industry leaders including GitHub's former CTO and GraphQL's co-creator warn that most organizations lack the scale to justify microservices overhead. The operational costs, developer productivity drain, testing complexity, and data consistency challenges often outweigh benefits. Modular monoliths and service-oriented architectures offer comparable scalability without distributed system complexity. Docker provides deployment consistency across any architecture, not just microservices. The key question: does your actual scale and team structure justify the microservices premium, or are you choosing complexity over business needs?

  3. 3
    Article
    Avatar of infoqInfoQ·26w

    Stripe's Zero-Downtime Data Movement Platform Migrates Petabytes with Millisecond Traffic Switches

    Stripe developed a Zero-Downtime Data Movement Platform that migrates petabyte-scale databases with traffic switches completing in milliseconds to 2 seconds. The system handles 5 million queries per second across 2,000+ MongoDB shards using a six-phase process: migration registration, bulk import (10x faster through B-tree-optimized inserts), bidirectional async replication, validation, versioned gating for traffic cutover, and cleanup. The platform enables horizontal scaling, shard merging, version upgrades, and tenancy transitions while maintaining 99.9995% reliability for $1.4 trillion in annual transactions.

  4. 4
    Video
    Avatar of codeheadCodeHead·26w

    10 Concepts EVERY Backend Dev Should Know

    Covers 10 fundamental backend development concepts including authentication vs authorization, rate limiting, database indexes, ACID transactions, caching strategies, message queues, load balancing, CAP theorem, reverse proxies, and CDNs. Explains how each concept solves real-world problems like security, performance, scalability, and reliability in production systems.

  5. 5
    Article
    Avatar of brookerMarc Brooker·26w

    Why Strong Consistency?

    Eventual consistency in database architectures creates significant challenges for both application developers and end users. Common issues include race conditions where newly created resources appear to not exist, complex retry logic requirements, and limitations on read replica effectiveness for read-modify-write operations. Aurora DSQL addresses these problems by providing strongly consistent reads across all replicas while maintaining read scalability, eliminating the need for applications to handle replication lag and routing complexity.

  6. 6
    Article
    Avatar of hnHacker News·28w

    Modular Monolith and Microservices: Modularity is what truly matters

    Modularity is the fundamental principle in software architecture, independent of whether you choose a monolith or microservices. The article explores five implementation strategies ranging from simple modular monoliths (modules as folders) to full microservices, emphasizing that good module separation based on domain understanding should drive architectural decisions, not the other way around. Key insight: start simple with a modular monolith and only increase complexity when justified by specific needs like resource optimization or team scaling. The author advocates for constrained microservices (microliths) that prohibit synchronous inter-service calls during request handling, reducing distributed system complexity while maintaining deployment independence.

  7. 7
    Article
    Avatar of muratbuffaloMetadata·27w

    Disaggregated Database Management Systems

    Explores how cloud trends are reshaping database architecture through disaggregation—separating compute, storage, and memory into independently scalable components. Examines three case studies: Google AlloyDB (PostgreSQL with compute-storage separation and HTAP support), Rockset (real-time analytics using the Aggregator-Leaf-Tailer pattern), and Nova-LSM (LSM-based storage with immutable SSTs in object stores). Discusses emerging hardware disaggregation including RDMA-based memory systems, CXL coherent memory fabrics, and DPU-based approaches. Highlights open challenges around automatic workload-driven assembly, co-design across fabrics, correctness verification, and adaptive reconfiguration.

  8. 8
    Article
    Avatar of systemdesigncodexSystem Design Codex·28w

    Key Concepts of Kafka

    Kafka is a distributed event store and streaming platform that has become essential for large-scale data pipelines at companies like Netflix and Uber. The core architecture consists of messages organized into topics and partitions, with producers writing data and consumers reading it in groups. Brokers form clusters that handle message storage and replication for reliability. Key advantages include support for multiple producers and consumers, disk-based retention for durability, and horizontal scalability. However, challenges include complex configuration options, inconsistent tooling, limited client library maturity outside Java/C, and lack of true multi-tenancy.

  9. 9
    Article
    Avatar of architectureweeklyArchitecture Weekly·27w

    Requeuing Roulette in Event-Driven Architecture and Messaging

    Explores the "Requeuing Roulette" pattern in event-driven systems, where messages are put back into queues hoping for correct ordering. While this technique can work when messages aren't causally correlated and consumers are stable, it creates risks under load: messages may be reprocessed out of order, causing race conditions and CPU waste. The pattern attempts to maintain strict ordering while maximizing throughput, but this trade-off often fails in distributed systems. Better alternatives include using message grouping features (RabbitMQ routing keys, SQS message groups, Service Bus sessions) or streaming solutions like Kafka that handle ordering through partitions. Understanding actual ordering requirements and choosing simpler solutions typically beats trying to make requeueing work reliably.

  10. 10
    Article
    Avatar of systemdesigncodexSystem Design Codex·26w

    An Intro to DB Sharding

    Database sharding is a horizontal scaling strategy that partitions data across multiple servers to improve query performance and handle larger workloads. The guide covers three main sharding strategies: key-based (using hash functions for even distribution), range-based (partitioning by value ranges), and directory-based (using lookup tables). While sharding offers benefits like improved performance and reliability, it introduces complexity, potential data imbalance, and limitations on cross-shard operations. The article emphasizes that sharding should be a last resort after exhausting simpler options like indexing and replication, and provides guidance on choosing strategies based on read/write patterns and avoiding common pitfalls.

  11. 11
    Article
    Avatar of baeldungBaeldung·25w

    Temporal Workflow Engine with Spring Boot

    Temporal is a workflow engine that enables resilient, deterministic execution of business processes. The Spring Boot integration simplifies setup through automatic workflow and activity registration, declarative worker queue configuration, and auto-configured WorkflowClient beans. The tutorial demonstrates building an order processing workflow with parallel execution, external event handling, timeouts, and failure recovery. Key implementation patterns include using Async.function() for parallel branches, Workflow.await() for blocking on conditions, and signal/query methods for external interaction. The integration supports easy switching between local development, in-memory testing, and production environments through configuration properties.

  12. 12
    Video
    Avatar of bytebytegoByteByteGo·27w

    System Design: Why is Kafka Popular?

    Kafka enables companies like LinkedIn, Netflix, and Uber to handle billions of messages daily through its distributed log architecture. It decouples services by allowing producers and consumers to communicate asynchronously, absorbs traffic spikes, and enables event replay for debugging. Messages are written to append-only partitions organized into topics across broker clusters. Key features include consumer groups for parallel processing, replication for durability, and three delivery guarantees (at-most-once, at-least-once, exactly-once). Partitioning strategy is critical—poor key selection creates hot partitions, while compound keys distribute load effectively. Trade-offs include added operational complexity, optimized throughput over latency, and ordering guarantees limited to single partitions. Event sourcing patterns use Kafka as the source of truth by appending state changes as events.

  13. 13
    Article
    Avatar of flipkartFlipkart Tech·28w

    When Good Locks Go Bad: Diagnosing a System Meltdown Under Load

    Engineers at Flipkart diagnosed a critical system failure during load testing for their Big Billion Days sale. Their Mirana service crashed under load due to excessive contention on a Redis distributed lock. Initial solutions using queuing failed because they violated the 'fail fast' principle. The team ultimately solved the problem by implementing an AtomicInteger-based semaphore to limit concurrent threads attempting lock acquisition. The key insight was optimizing for actual service performance (200-300ms per request) rather than downstream resource limits, reducing allowed concurrency from 128 to 5 threads per pod and achieving stable throughput of 90 QPS across 9 pods.