Best of Distributed Systems — January 2025

1
Article
Javarevisited·1y
System Design CheatSheet for Interview
This post provides a comprehensive cheatsheet of essential system design concepts commonly covered in interviews. Topics include REST API, networking, OAuth & JWT, cookies vs sessions, CI/CD workflows, Kafka, various databases, system testing, Git, Docker, Kubernetes, design patterns, logging, load balancing, and more. It's aimed at helping readers quickly revise these concepts before an interview.
1.2K
19
2
Article
System Design Codex·1y
8 Must-Know Distributed System Design Patterns
Distributed systems are crucial for scalability, fault tolerance, and high availability but pose challenges such as state management, failure handling, and communication. Key design patterns like Ambassador Pattern, Circuit Breaker Pattern, CQRS, Sharding, Sidecar Pattern, Pub/Sub Pattern, Leader Election, and Event Sourcing help address these challenges by offloading tasks, preventing cascading failures, separating read/writes, partitioning data, decoupling concerns, enabling async communication, managing shared resources, and capturing state changes as events.
379
6
3
Article
swizec.com·1y
Why you need a task queue
Task queues allow code to defer work, improving the performance and reliability of systems, especially during high load or vendor outages. Servers process requests either one by one or concurrently, and different setups present unique challenges. By leveraging queuing theory, systems can manage load effectively, and task queues can help by deferring and buffering tasks, handling API rate limits, and recovering from partial failures.
75
2
4
Article
ByteByteGo·1y
How Airbnb Built a Key-Value Store for Petabytes of Data
Airbnb developed a key-value store named Mussel to handle petabytes of derived data with high reliability, availability, and low latency. Mussel's architecture leverages sharding, Kafka for replication, and HRegion for unified real-time and batch data storage. This system supports efficient bulk loading and offers impressive performance metrics, including over 99.9% availability and sub-8 millisecond read latency. Mussel overcame the limitations of previous solutions like HFileService and Nebula by automating shard management with Apache Helix and using Spark for incremental bulk data loads.
64
5
Article
Community Picks·1y
Papers We Love
Papers We Love is a repository and community dedicated to academic computer science papers. It features various chapters worldwide and organizes meetups discussing topics such as neural networks for detecting epileptic attacks, the Exponential Time Hypothesis, Named Data Networking, and serverless frameworks.
38
6
Article
Collections·1y
Every System is a Log: Avoiding Coordination in Distributed Applications
Building resilient distributed applications can be challenging due to failovers, retries, and coordination. Treating every system as a log simplifies state management and coordination across system components like databases, message queues, and locking services. The open-source project Restate embodies this concept by using logs to manage state and coordination, providing a practical solution for creating robust workflows and ensuring consistent operations in distributed systems.
35
7
Article
Faun·1y
Introduction Guide to RPC in Golang
RPC (Remote Procedure Call) simplifies communication between services by allowing a procedure to be executed on a remote machine as if it was a local function. This guide explains the need for RPC in place of traditional API calls in distributed systems, provides an example of a key-value store implementation, and outlines server and client design in Golang using the rpc package. It also covers the structure of RPC functions and the practical steps to run both the server and client.
29
8
Article
Metadata·1y
Use of Time in Distributed Databases (part 5): Lessons learned
Exploring the pivotal role of synchronized time in distributed databases for performance optimization and alignment, this piece discusses how systems like Spanner, CockroachDB, and DynamoDB use time for consistent decision-making, conflict detection, and fencing mechanisms. The trend towards advanced time-based techniques and speculation is highlighted, emphasizing future research in time synchronization precision and isolation guarantees.
19
9
Article
Flipkart Tech·1y
Real-Time Data Propagation with HBase: Exploring Change Data Capture and Its Challenges
Change Data Capture (CDC) in HBase enables tracking and capturing data changes in real time and making them available for other systems. HBase, a distributed non-relational data store, uses its Write Ahead Log (WAL) to implement CDC. This process supports various business use cases like ad campaigns and e-commerce transactions at Flipkart. The post discusses the architecture, methods of data propagation—Mutation and Cell Based Change Propagation—filters applied, and the challenges encountered in using these methods, providing insights into efficient data tracking and propagation.
16
10
Article
Collections·1y
Understanding Apache Kafka: Basics and Key Features
Apache Kafka is a distributed event-streaming platform designed for real-time data processing. It manages data flow efficiently in event-driven systems with components like topics, partitions, producers, consumers, and brokers. Kafka ensures high availability through data replication and a leader-follower model. Its architecture supports data persistence and parallel processing via consumer groups. The recent introduction of Kafka Raft (KRaft) aims to simplify cluster management.
13

See all Distributed Systems archives