Best of Distributed Systems — 2025

1
Article
Javarevisited·1y
System Design CheatSheet for Interview
This post provides a comprehensive cheatsheet of essential system design concepts commonly covered in interviews. Topics include REST API, networking, OAuth & JWT, cookies vs sessions, CI/CD workflows, Kafka, various databases, system testing, Git, Docker, Kubernetes, design patterns, logging, load balancing, and more. It's aimed at helping readers quickly revise these concepts before an interview.
1.2K
19
2
Article
Tech World With Milan·1y
How does Netflix manage to show you a movie without interruptions?
Netflix delivers buffer-free streaming through a sophisticated distributed systems architecture. The platform uses Amazon Web Services for managing control-plane operations and its custom Content Delivery Network, Open Connect, to handle data-plane operations. Key components include hundreds of microservices, a two-tier CDN deployment, adaptive bitrate streaming, and advanced resilience engineering practices. This setup allows for smooth content delivery and high availability, even under heavy load.
434
10
3
Article
System Design Codex·1y
8 Must-Know Distributed System Design Patterns
Distributed systems are crucial for scalability, fault tolerance, and high availability but pose challenges such as state management, failure handling, and communication. Key design patterns like Ambassador Pattern, Circuit Breaker Pattern, CQRS, Sharding, Sidecar Pattern, Pub/Sub Pattern, Leader Election, and Event Sourcing help address these challenges by offloading tasks, preventing cascading failures, separating read/writes, partitioning data, decoupling concerns, enabling async communication, managing shared resources, and capturing state changes as events.
379
6
4
Article
Javarevisited·29w
6 Must-Read Books for Backend Developers in 2026
A curated list of six essential books for backend developers covering software architecture, design patterns, distributed systems, microservices, and data engineering. The recommendations include classics like "Designing Data-Intensive Applications" by Martin Kleppmann, "The Pragmatic Programmer," and "Building Microservices" by Sam Newman, focusing on fundamental principles that remain relevant despite changing frameworks and tools. Each book addresses critical aspects of backend development from API design and scalability to data pipelines and architectural trade-offs.
353
2
5
Article
Medium·1y
10 System Design Concepts You Must Master Before Your Next SDE Interview (with Resources)
Preparing for system design interviews, especially for roles at big tech companies, requires mastering key concepts like web fundamentals, core components of large-scale systems, databases, caching, messaging and queuing systems, system communication, scalability, security, high availability, and fault tolerance. Practical knowledge and examples, such as designing an event notification system or Netflix architecture, are also crucial. Detailed resources and guides are recommended for in-depth understanding and effective preparation.
342
4
6
Article
Materialized View·51w
Kafka: The End of the Beginning
Apache Kafka has dominated streaming data for over a decade, but innovation has stagnated while batch processing has evolved rapidly. The streaming ecosystem faces challenges with slow growth, long sales cycles, and lack of new ideas. While Kafka's protocol has become the de facto standard, its architecture shows limitations for modern cloud-native requirements. New solutions like S2 are emerging with fresh approaches, and the next decade could see a transition similar to how batch processing moved beyond Hadoop, potentially ushering in a truly cloud-native streaming era.
293
6
7
Article
Tech World With Milan·48w
What I learned from the book Designing Data-Intensive Applications
A comprehensive review of Martin Kleppmann's "Designing Data-Intensive Applications" after two complete readings. The book provides foundational knowledge about distributed systems, covering reliability, scalability, and maintainability principles. Key topics include data models (relational vs document vs graph), storage engines (B-trees vs LSM-trees), replication strategies, partitioning, transactions, and stream processing. The review highlights the book's strengths in explaining trade-offs and connecting theory to practice, while noting limitations like outdated examples and dense theoretical content. Recommended for experienced engineers working with data-intensive systems.
250
6
8
Article
System Design Codex·44w
Must-Know Event-Driven Architectural Patterns
Seven essential event-driven architectural patterns are explored: Competing Consumer for scaling workloads, Asynchronous Task Execution for decoupled processing, Consume and Project for read-optimized views, Saga for distributed transactions, Event Aggregation for combining events, Event Sourcing for complete audit trails, and Transactional Outbox for atomic database and event operations. Each pattern addresses specific challenges in building resilient, scalable event-driven systems with practical examples and implementation considerations.
212
9
Article
Milan Jovanović·1y
Understanding Microservices: Core Concepts and Benefits
Microservices are independently deployable services centered around business domains, offering flexibility, adaptability, and targeted scaling. They enable parallel development, technology diversity, and organizational alignment but introduce challenges like distributed system complexity, operational overhead, and data consistency issues. Effective microservices adoption often starts small and evolves over time, focusing on the most beneficial parts of the existing architecture.
196
2
10
Video
ByteByteGo·47w
7 System Design Concepts Explained in 10 Minutes
Seven fundamental concepts power reliable distributed systems: CAP theorem forces choosing between consistency and availability during network partitions, eventual consistency enables high performance through delayed convergence, load balancers distribute traffic using Layer 4 or Layer 7 strategies, consistent hashing minimizes data movement when scaling nodes, circuit breakers prevent cascade failures by blocking requests to failing services, rate limiting protects against overload using token bucket or sliding window algorithms, and monitoring provides visibility through metrics, logs, traces, and events to maintain system health.
195
3
11
Video
The Coding Gopher·1y
99% of Developers Don't Get RPCs
RPC, or Remote Procedure Call, is a critical communication protocol in distributed systems, allowing for code execution on remote systems as if they were local. This method abstracts networking complexities, making it ideal for microservices and internal systems that require efficiency and strict contracts. Unlike REST, which uses HTTP verbs and is better for external APIs, RPC offers granular function-level control, better performance with binary formats like Protobuf, and advanced capabilities like streaming and retries. gRPC enhances these benefits with efficient communication and built-in logging and metrics, making it a superior choice for modern backend architectures.
181
5
12
Article
Three Dots Labs·50w
Event Driven Architecture: The Hard Parts
Event-driven architecture offers powerful benefits like scaling and decoupling but comes with significant challenges. Key issues include debugging async systems without proper observability, handling eventual consistency, preventing message loss through the outbox pattern, and designing events that avoid tight coupling. The architecture requires idempotent handlers to manage duplicate message delivery, proper dead letter queue handling, and careful consideration of message ordering. While EDA can solve real problems, it adds complexity that isn't always justified - sometimes synchronous systems or monoliths are better choices.
174
1
13
Article
Hacker News·22w
Your Logs Are Lying To You
Traditional logging practices fail in modern distributed systems because they produce fragmented, context-poor log lines that are difficult to search and correlate. The solution is "wide events" (also called canonical log lines): emitting one comprehensive, structured event per request per service that contains all relevant context—user data, business metrics, infrastructure details, and error information. This approach transforms debugging from text searching into structured querying, enabling complex questions to be answered with simple SQL-like queries. Key implementation strategies include building events throughout the request lifecycle, using tail-based sampling to keep all errors while sampling successful requests, and deliberately instrumenting code with business context rather than relying on auto-instrumentation alone.
154
3
14
Article
Docker·25w
You Want Microservices—But Do You Need Them?
Microservices have become the default architectural choice despite their significant complexity costs. Amazon Prime Video achieved 90% cost reduction by reverting to a monolith, while companies like Twilio Segment and Shopify found success with simpler architectures. Industry leaders including GitHub's former CTO and GraphQL's co-creator warn that most organizations lack the scale to justify microservices overhead. The operational costs, developer productivity drain, testing complexity, and data consistency challenges often outweigh benefits. Modular monoliths and service-oriented architectures offer comparable scalability without distributed system complexity. Docker provides deployment consistency across any architecture, not just microservices. The key question: does your actual scale and team structure justify the microservices premium, or are you choosing complexity over business needs?
149
11
15
Article
Community Picks·1y
Redis Deep Dive for System Design Interviews
Redis is a versatile and simple tool ideal for system design interviews due to its diverse capabilities and ease of understanding. It supports various data structures and communication patterns, making it suitable for high-speed caching, distributed locking, rate limiting, and proximity searches. Nevertheless, its in-memory nature means it lacks durability, requiring careful consideration in design decisions.
146
1
16
Article
Medium·47w
Why We Replaced Kafka with gRPC for Service Communication
A development team replaced Kafka with gRPC for synchronous service communication in their loan servicing platform after experiencing issues with debugging, latency, and operational complexity. While keeping Kafka for appropriate use cases like audit logs and fan-out patterns, they found gRPC provided better performance (70-80% latency reduction), easier debugging, and simpler infrastructure management for request-response interactions. The key lesson was using each tool for its intended purpose rather than forcing one solution everywhere.
135
7
17
Article
System Design Codex·39w
A Quick Guide to RabbitMQ
RabbitMQ is a message broker that enables asynchronous communication between applications by acting as a middleman. Messages flow from producers to exchanges, which route them to queues based on bindings and routing keys, where consumers can process them. The system supports different exchange types (direct, topic, fanout) for various routing patterns, providing decoupling, scalability, and reliability for distributed systems.
125
18
Article
Javarevisited·38w
How ByteByteGo Makes System Design Easy for Visual Learners?
ByteByteGo excels at teaching system design through visual-first learning, using clear diagrams and step-by-step breakdowns to explain complex concepts like caching, load balancing, and distributed systems. The platform offers consistent visual materials across books, videos, and courses, featuring real-world case studies of systems like YouTube, Twitter, and Uber. Visual learners benefit from the diagram-driven approach that transforms abstract concepts into clear, memorable mental maps, making it particularly effective for technical interview preparation.
123
19
Article
ByteByteGo·1y
EP160: Top 20 System Design Concepts You Should Know
Discover essential system design concepts such as load balancing, caching, and database sharding, which are crucial for building scalable and reliable systems. Learn about key elements like the CAP theorem and message queues, which help in creating robust distributed architectures.
122
1
20
Article
ByteByteGo·32w
How Flipkart Built a Highly Available MySQL Cluster for 150+ Million Users
Flipkart built Altair, an internally managed MySQL service that maintains high availability for 150+ million daily users through automated failover and primary-replica architecture. The system uses a three-layered monitoring approach (agent, monitor, orchestrator) to detect failures, prevent false positives, and execute failovers with minimal data loss. Altair prioritizes write availability over strong consistency using asynchronous replication, implements DNS-based service discovery for seamless failovers, and includes multiple safeguards against split-brain scenarios. The design balances operational simplicity with reliability, achieving near five-nines availability while managing thousands of database clusters across Flipkart's microservices infrastructure.
120
21
Article
Metadata·37w
Disaggregation: A New Architecture for Cloud Databases
Disaggregated database architecture separates compute and storage into independent, scalable components to better exploit cloud elasticity. This approach addresses the asymmetry between expensive, fluctuating compute resources and cheaper, stable storage. Modern systems like Snowflake and Aurora demonstrate this pattern, with newer implementations pushing disaggregation further into specialized services. While disaggregation enables better resource utilization and cost optimization, it introduces performance tradeoffs due to network communication overhead. The architecture also opens opportunities to rethink distributed protocols and enables new capabilities like real-time HTAP systems and specialized hardware adoption.
111
5
22
Article
ByteByteGo·1y
How Netflix Orchestrates Millions of Workflow Jobs with Maestro
Netflix transitioned from using the Meson orchestrator to Maestro due to scalability issues with the growing volume of data and workflows. Maestro, built with a distributed microservices architecture, efficiently manages large-scale workflows with high reliability and low operational overhead. It supports dynamic workflows, defined via DSLs, a visual UI, or programmatic APIs, and leverages technologies such as CockroachDB and distributed queues. Features like event publishing, parameterized workflows, and an integrated signal service enable Maestro to handle extensive data processing and machine learning tasks at scale.
107
23
Article
Salesforce Engineering·29w
Architecting Multi-System Production Platform
Salesforce built Digital Wallet, a consumption-based pricing platform serving 15,000+ organizations and generating $400M+ in annual contract value. The engineering team overcame significant challenges as Data Cloud's first customer, including implementing SOX-compliant metadata security through Strict System Mode, building a custom event subscriber processing 20M daily events, and architecting failover strategies for near real-time usage tracking. The platform integrates multiple systems using fan-out mechanisms for entitlement sync, implements Spark job failover between EMR-on-EKS and EMR-on-EC2 to avoid rate limits, and maintains billing accuracy through architectural separation of hourly customer-facing updates from monthly financial reconciliation. The system includes high-cardinality monitoring, automatic retry logic, and a month-long buffer for usage reconciliation before billing.
106
24
Article
ByteByteGo·50w
EP166: What is Event Sourcing?
Event sourcing is a design paradigm that stores events leading to state changes rather than current state data, providing determinism and recoverability. The approach uses an append-only event store with sequenced events to rebuild application state. The newsletter also covers software deployment pipelines, data lake architecture, Netflix's distributed counter implementation, and TCP handshake mechanics.
93
25
Article
ElixirStatus·1y
Overengineered #001: Hello World
The post explores building an overengineered 'Hello World' system using Elixir. It demonstrates creating a distributed system where multiple nodes automatically discover each other and send 'hello world' messages to newly joined nodes. The setup involves using GenServers, UDP broadcast for node discovery, and handling node greetings with the Greeter module. The project is aimed at learning and fun, showcasing an extensive approach to a simple problem.
92
1

See all Distributed Systems archives