Best of Distributed Systems — November 2025

1
Article
Javarevisited·29w
6 Must-Read Books for Backend Developers in 2026
A curated list of six essential books for backend developers covering software architecture, design patterns, distributed systems, microservices, and data engineering. The recommendations include classics like "Designing Data-Intensive Applications" by Martin Kleppmann, "The Pragmatic Programmer," and "Building Microservices" by Sam Newman, focusing on fundamental principles that remain relevant despite changing frameworks and tools. Each book addresses critical aspects of backend development from API design and scalability to data pipelines and architectural trade-offs.
353
2
2
Article
Docker·25w
You Want Microservices—But Do You Need Them?
Microservices have become the default architectural choice despite their significant complexity costs. Amazon Prime Video achieved 90% cost reduction by reverting to a monolith, while companies like Twilio Segment and Shopify found success with simpler architectures. Industry leaders including GitHub's former CTO and GraphQL's co-creator warn that most organizations lack the scale to justify microservices overhead. The operational costs, developer productivity drain, testing complexity, and data consistency challenges often outweigh benefits. Modular monoliths and service-oriented architectures offer comparable scalability without distributed system complexity. Docker provides deployment consistency across any architecture, not just microservices. The key question: does your actual scale and team structure justify the microservices premium, or are you choosing complexity over business needs?
149
11
3
Article
InfoQ·26w
Stripe's Zero-Downtime Data Movement Platform Migrates Petabytes with Millisecond Traffic Switches
Stripe developed a Zero-Downtime Data Movement Platform that migrates petabyte-scale databases with traffic switches completing in milliseconds to 2 seconds. The system handles 5 million queries per second across 2,000+ MongoDB shards using a six-phase process: migration registration, bulk import (10x faster through B-tree-optimized inserts), bidirectional async replication, validation, versioned gating for traffic cutover, and cleanup. The platform enables horizontal scaling, shard merging, version upgrades, and tenancy transitions while maintaining 99.9995% reliability for $1.4 trillion in annual transactions.
44
4
Video
CodeHead·26w
10 Concepts EVERY Backend Dev Should Know
Covers 10 fundamental backend development concepts including authentication vs authorization, rate limiting, database indexes, ACID transactions, caching strategies, message queues, load balancing, CAP theorem, reverse proxies, and CDNs. Explains how each concept solves real-world problems like security, performance, scalability, and reliability in production systems.
30
5
Article
Marc Brooker·26w
Why Strong Consistency?
Eventual consistency in database architectures creates significant challenges for both application developers and end users. Common issues include race conditions where newly created resources appear to not exist, complex retry logic requirements, and limitations on read replica effectiveness for read-modify-write operations. Aurora DSQL addresses these problems by providing strongly consistent reads across all replicas while maintaining read scalability, eliminating the need for applications to handle replication lag and routing complexity.
31
2
6
Article
Hacker News·28w
Modular Monolith and Microservices: Modularity is what truly matters
Modularity is the fundamental principle in software architecture, independent of whether you choose a monolith or microservices. The article explores five implementation strategies ranging from simple modular monoliths (modules as folders) to full microservices, emphasizing that good module separation based on domain understanding should drive architectural decisions, not the other way around. Key insight: start simple with a modular monolith and only increase complexity when justified by specific needs like resource optimization or team scaling. The author advocates for constrained microservices (microliths) that prohibit synchronous inter-service calls during request handling, reducing distributed system complexity while maintaining deployment independence.
28
1
7
Article
Metadata·27w
Disaggregated Database Management Systems
Explores how cloud trends are reshaping database architecture through disaggregation—separating compute, storage, and memory into independently scalable components. Examines three case studies: Google AlloyDB (PostgreSQL with compute-storage separation and HTAP support), Rockset (real-time analytics using the Aggregator-Leaf-Tailer pattern), and Nova-LSM (LSM-based storage with immutable SSTs in object stores). Discusses emerging hardware disaggregation including RDMA-based memory systems, CXL coherent memory fabrics, and DPU-based approaches. Highlights open challenges around automatic workload-driven assembly, co-design across fabrics, correctness verification, and adaptive reconfiguration.
27
8
Article
System Design Codex·28w
Key Concepts of Kafka
Kafka is a distributed event store and streaming platform that has become essential for large-scale data pipelines at companies like Netflix and Uber. The core architecture consists of messages organized into topics and partitions, with producers writing data and consumers reading it in groups. Brokers form clusters that handle message storage and replication for reliability. Key advantages include support for multiple producers and consumers, disk-based retention for durability, and horizontal scalability. However, challenges include complex configuration options, inconsistent tooling, limited client library maturity outside Java/C, and lack of true multi-tenancy.
25
9
Article
Architecture Weekly·27w
Requeuing Roulette in Event-Driven Architecture and Messaging
Explores the "Requeuing Roulette" pattern in event-driven systems, where messages are put back into queues hoping for correct ordering. While this technique can work when messages aren't causally correlated and consumers are stable, it creates risks under load: messages may be reprocessed out of order, causing race conditions and CPU waste. The pattern attempts to maintain strict ordering while maximizing throughput, but this trade-off often fails in distributed systems. Better alternatives include using message grouping features (RabbitMQ routing keys, SQS message groups, Service Bus sessions) or streaming solutions like Kafka that handle ordering through partitions. Understanding actual ordering requirements and choosing simpler solutions typically beats trying to make requeueing work reliably.
20
10
Article
System Design Codex·26w
An Intro to DB Sharding
Database sharding is a horizontal scaling strategy that partitions data across multiple servers to improve query performance and handle larger workloads. The guide covers three main sharding strategies: key-based (using hash functions for even distribution), range-based (partitioning by value ranges), and directory-based (using lookup tables). While sharding offers benefits like improved performance and reliability, it introduces complexity, potential data imbalance, and limitations on cross-shard operations. The article emphasizes that sharding should be a last resort after exhausting simpler options like indexing and replication, and provides guidance on choosing strategies based on read/write patterns and avoiding common pitfalls.
12
11
Article
Baeldung·25w
Temporal Workflow Engine with Spring Boot
Temporal is a workflow engine that enables resilient, deterministic execution of business processes. The Spring Boot integration simplifies setup through automatic workflow and activity registration, declarative worker queue configuration, and auto-configured WorkflowClient beans. The tutorial demonstrates building an order processing workflow with parallel execution, external event handling, timeouts, and failure recovery. Key implementation patterns include using Async.function() for parallel branches, Workflow.await() for blocking on conditions, and signal/query methods for external interaction. The integration supports easy switching between local development, in-memory testing, and production environments through configuration properties.
10
12
Video
ByteByteGo·27w
System Design: Why is Kafka Popular?
Kafka enables companies like LinkedIn, Netflix, and Uber to handle billions of messages daily through its distributed log architecture. It decouples services by allowing producers and consumers to communicate asynchronously, absorbs traffic spikes, and enables event replay for debugging. Messages are written to append-only partitions organized into topics across broker clusters. Key features include consumer groups for parallel processing, replication for durability, and three delivery guarantees (at-most-once, at-least-once, exactly-once). Partitioning strategy is critical—poor key selection creates hot partitions, while compound keys distribute load effectively. Trade-offs include added operational complexity, optimized throughput over latency, and ordering guarantees limited to single partitions. Event sourcing patterns use Kafka as the source of truth by appending state changes as events.
10
13
Article
Flipkart Tech·28w
When Good Locks Go Bad: Diagnosing a System Meltdown Under Load
Engineers at Flipkart diagnosed a critical system failure during load testing for their Big Billion Days sale. Their Mirana service crashed under load due to excessive contention on a Redis distributed lock. Initial solutions using queuing failed because they violated the 'fail fast' principle. The team ultimately solved the problem by implementing an AtomicInteger-based semaphore to limit concurrent threads attempting lock acquisition. The key insight was optimizing for actual service performance (200-300ms per request) rather than downstream resource limits, reducing allowed concurrency from 128 to 5 threads per pod and achieving stable throughput of 90 QPS across 9 pods.
10

See all Distributed Systems archives