Best of Distributed Systems — January 2026

1
Article
DEV·18w
Microservices Are Killing Your Performance (And Here's the Math)
Microservices introduce significant performance overhead compared to monolithic architectures. Network calls add 1,000-5,000x latency versus in-process function calls, resulting in 50-150% higher latency, 300% more resource usage, and 2-3x infrastructure costs. Benchmarks show microservices suffer from cascading failures (5x more downtime), database connection exhaustion, and serialization overhead. The article argues microservices solve organizational problems (team autonomy, independent deployment) rather than technical ones, and recommends modular monoliths for most applications. Microservices only make sense for organizations with 50+ engineers, independent scaling requirements, technology diversity needs, or compliance isolation.
436
39
2
Article
ByteByteGo·18w
How Uber Serves over 150 Million Reads per Second from Integrated Cache
Uber's CacheFront system serves over 150 million database reads per second using Redis while maintaining strong consistency. The system uses a three-layer architecture with Query Engine, Storage Engine, and integrated caching. Initial challenges included cache invalidation delays and stale data from conditional updates. Uber solved this by implementing soft deletes, monotonic timestamps, and synchronous write-path invalidation alongside asynchronous CDC (Flux) and TTL expiration. This triple-defense strategy achieves 99.9%+ cache hit rates with near-zero stale values, even with 24-hour TTLs.
216
4
3
Article
ByteByteGo·16w
How Google Manages Trillions of Authorizations with Zanzibar
Zanzibar is Google's global authorization system that handles over 10 million permission checks per second across services like Drive, YouTube, and Maps. It uses a tuple-based data model to represent permissions, employs zookies (tokens) with Google Spanner's TrueTime for consistency guarantees, and runs on 10,000+ servers across 30+ geographic locations. The system achieves 99.999% availability through distributed caching, request deduplication, and client isolation, with 99% of checks served in 3ms median latency. Key architectural decisions include flexible relation tuples, causality-respecting consistency protocols, and optimized serving layers with intelligent caching strategies.
96
2
4
Article
ByteByteGo·17w
How Netflix Built a Real-Time Distributed Graph for Internet Scale
Netflix built a Real-Time Distributed Graph (RDG) to track member interactions across streaming, gaming, and other services. The system processes millions of events per second using Apache Kafka for ingestion, Apache Flink for stream processing, and a custom Key-Value Data Abstraction Layer (KVDAL) built on Cassandra for storage. Netflix rejected traditional graph databases like Neo4j and AWS Neptune due to scalability limitations and operational complexity, instead emulating graph capabilities using key-value storage with adjacency lists. The architecture handles 8 billion nodes, 150 billion edges, 2 million reads/second, and 6 million writes/second across 2,400 EC2 instances, with each node and edge type isolated in separate namespaces for independent scaling.
88
5
Article
ByteByteGo·17w
How Pinterest Built An Async Compute Platform for Billions of Task Executions
Pinterest rebuilt their asynchronous job processing platform from Pinlater to Pacer to handle billions of daily tasks. The original system suffered from database lock contention, lack of queue isolation, and inefficient sharding. Pacer introduced dedicated dequeue broker services managed by Helix, eliminated lock contention through single-broker partition ownership, implemented in-memory caching for sub-millisecond latency, adopted adaptive sharding based on queue size, and isolated worker pods on Kubernetes with custom resource allocations per queue.
35
6
Article
Metadata·19w
The Sauna Algorithm: Surviving Asynchrony Without a Clock
A creative analogy explains how distributed systems can coordinate without synchronized clocks by using causal relationships between events. The "sauna algorithm" demonstrates the happened-before relationship: by tracking when someone enters after you and leaving when they leave, you ensure a safe duration without measuring time. This mirrors how asynchronous distributed systems use logical clocks and causal ordering to maintain consistency. The approach highlights the importance of anchoring to events that occur after your entry point, avoiding stale state, though it introduces potential deadlock risks if universally adopted.
33
4
7
Article
Metadata·17w
Welcome to Town Al-Gasr
A satirical allegory about distributed systems and autonomous agents gone wrong. Through the fictional town of Al-Gasr, the piece critiques common anti-patterns in system design: multiple sources of truth, lack of testing disguised as confidence, political decision-making over engineering principles, and eventual consistency taken to absurd extremes. The narrative lampoons organizational dysfunction, where ministries supervise each other in circles, failures are rebranded as victories, and the system maintains three simultaneous leaders for 'high availability.' It's a cautionary tale about what happens when governance, accountability, and technical rigor collapse in autonomous systems.
29
2
8
Article
BigData Boutique blog·17w
OpenSearch Kubernetes Operator 3.0 - Stability and Resilience Finally Delivered
OpenSearch Kubernetes Operator 3.0 Alpha introduces major stability improvements including quorum-safe rolling restarts, multi-namespace support, TLS certificate hot reloading, and gRPC API support. The release addresses critical production issues like upgrade deadlocks, split-brain scenarios, and cluster instability through over 100 changes. Key features include SmartScaler enabled by default, init-container and sidecar support, NFS volumes, and OpenSearch 3.0 compatibility. The API is migrating from opensearch.opster.io to opensearch.org with automatic migration handling. Breaking changes include new security defaults and enabled validation webhooks. The alpha is recommended for testing in lower environments first, with GA release planned after beta testing.
22
9
Article
Architecture Weekly·20w
Rebuilding Event-Driven Read Models in a safe and resilient way
Event-driven read models can be rebuilt safely using a hybrid approach combining PostgreSQL advisory locks with persistent status tracking. Inline projections use shared locks for concurrent processing, while async rebuilds acquire exclusive locks to prevent conflicts. Advisory locks handle fast-path coordination and prevent races during transitions, while status columns provide crash recovery and persist across connection failures. This design enables zero-downtime rebuilds, handles multiple processor instances, and maintains consistency without external infrastructure.
16
10
Article
ByteByteGo·19w
EP197: 12 Architectural Concepts Developers Should Know
A curated collection covering essential architectural patterns and developer tools. The main list includes 12 core concepts like load balancing, caching, CDN, message queues, API gateways, circuit breakers, sharding, and auto-scaling. Additional sections cover developer tools for 2026 (IDEs, version control, CI/CD, containers), five rate limiting strategies (fixed window, sliding window, token bucket, leaky bucket), live streaming protocols and workflow, and five leader election algorithms used in distributed databases (Bully, Ring, Paxos, Raft, ZooKeeper).
11

See all Distributed Systems archives