Best of Distributed Systems — November 2024

1
Article
Community Picks·1y
How Distributed Systems Avoid Race Conditions using Pessimistic Locking
Pessimistic locking is used in distributed systems to prevent race conditions by ensuring that only one process can access shared data at a time. This involves using a cluster-wide lock database to manage locks and employing leases to release locks if a node fails. Fence tokens further ensure that stale updates are prevented by rejecting writes from nodes with out-of-date tokens.
86
2
Article
gitconnected·1y
Data Centers in System Design
A data center is a building full of servers, storage systems, and network equipment. Multi-data center architectures improve reliability and speed by connecting users to the nearest center using GeoDNS for traffic direction. Failover scenarios are managed by automatic detection and routing updates. Scaling strategies include component decoupling into microservices and using messaging architectures with message queues. Companies like Netflix and Amazon effectively implement these concepts for redundancy and scalability.
59
3
Article
Foojay.io·1y
Task Schedulers in Java: Modern Alternatives to Quartz Scheduler
Quartz has been a long-standing job scheduling library in Java, but several modern alternatives like JobRunr and db-scheduler offer more developer-friendly APIs, better performance, and enhanced support for distributed environments. JobRunr stands out for its ease of use and built-in dashboard, while db-scheduler is appreciated for its simpler configuration. For broader workflow management, solutions like Temporal and Kestra are noteworthy for their resilient and low-code features.
52
2
4
Article
System Design Codex·1y
How Consistent Hashing Works?
Consistent hashing is a technique used for distributing keys uniformly across a cluster of nodes, minimizing the number of keys that need to be moved when nodes are added or removed. Steps include hashing keys and nodes using a hash function, placing them on a circular space or ring, and assigning keys to the nearest node in a clockwise direction. Virtual nodes help with load balancing by mapping physical nodes to multiple positions on the ring. This makes the technique scalable, load-balanced, and fault-tolerant, though it relies heavily on the quality of the hash function used.
50
5
Article
swizec.com·2y
Why software only moves forward
Software systems, especially at scale, cannot afford rollbacks or cut-overs and must always move forward due to the permanent nature of data. Data, once saved, must be managed forever, requiring updates to be additive and systems to be distributed. Challenges arise as different parts of the system need to operate on shared definitions of business logic, leading to complexities during updates. Key strategies include making additive changes, being permissive about inputs, and managing updates to both databases and code to ensure systems remain in sync.
49
1
6
Article
Architecture Weekly·1y
Deduplication in Distributed Systems: Myths, Realities, and Practical Solutions
Duplication in distributed systems is a common issue due to retries, processing failures, and fault tolerance mechanisms. Deduplication aims to identify and eliminate duplicate messages, but it comes with challenges that impact scalability, performance, and reliability. The post explores how deduplication is implemented in technologies like Kafka and RabbitMQ, and discusses the trade-offs and complexities involved. It also highlights the concept of exactly-once processing as a more realistic goal than exactly-once delivery, emphasizing patterns like idempotency and transactional outboxes to achieve robust message handling.
39
7
Article
The Polymathic Engineer·2y
The fallacies of distributed systems
This post discusses the eight commonly overlooked fallacies of distributed systems identified by Peter Deutsch and others at Sun Microsystems. These fallacies include the reliability of networks, latency, bandwidth limitations, network security, dynamic network topology, multiple administrators, transport costs, and network heterogeneity. Understanding and mitigating these assumptions are crucial in designing robust distributed systems. Several strategies such as retransmission mechanisms, caching, data compression, and security measures are recommended to tackle these issues.
38
1
8
Article
Netflix TechBlog·2y
Netflix’s Distributed Counter Abstraction
Netflix's Distributed Counter Abstraction is a high-performance, scalable counting service built on top of their TimeSeries Abstraction. It supports two primary counting modes—Best-Effort and Eventually Consistent—to cater to different use cases and trade-offs involving accuracy, latency, and infrastructure costs. The service aims to handle high throughput and availability by leveraging a combination of caching, durable queuing, and periodic aggregation mechanisms. Additionally, it incorporates various approaches to manage potential data loss, idempotency, and contention issues inherent in distributed systems.
35
9
Article
Cerbos·2y
How to address decentralized data management in microservices
Transitioning from monolithic to microservices architecture includes challenges and benefits in handling decentralized data management. The post discusses the advantages like scalability, flexibility, performance, and fault isolation, alongside challenges such as complex data integration, increased development complexity, latency issues, and security risks. It details patterns and techniques like eventual consistency, Saga pattern, event sourcing, domain-driven design (DDD), and command query responsibility segregation (CQRS) to mitigate these challenges. Uber's case study highlights practical implementation of these methods to maintain data integrity and ensure system reliability.
30
11
10
Article
swizec.com·2y
Why you need observability more than tests
Friday deploys can be daunting, but effective observability can quickly identify and resolve issues. Unlike tests, which can miss production-specific problems, observability provides real-time insights through centralized error logging and alerts. This approach facilitated a rapid response to a SQL error following an update, highlighting the importance of default instrumentation, easy log addition, and self-serve alert creation for maintaining system stability.
27
11
Article
ITNEXT·2y
Storage Disaggregated Databases and Shared Transaction Log Architecture In Comparison
The post compares two recent papers on database storage architecture: storage-disaggregated databases and shared transaction log architecture. The comparison highlights the different performance implications, scalability, and design complexities of the two architectures. Storage disaggregated databases separate storage from compute, allowing independent scaling and potentially reducing network bottlenecks. The shared log architecture focuses on high durability and fault tolerance through a State Machine Replication (SMR) group, simplifying log storage but requiring a more complex system to handle read and write operations. Both architectures have their advantages and challenges in terms of performance, scalability, and maintenance.
26
12
Article
Community Picks·1y
Kafka Architecture & Troubleshooting Quiz
Gauge your knowledge of Apache Kafka with a quiz that covers multi-datacenter deployments, event sourcing patterns, performance optimization, and troubleshooting in production. Enhance your understanding of Kafka's internals and best practices for building reliable, scalable distributed systems.
24

See all Distributed Systems archives