Best of Distributed Systems — October 2025

1
Article
ByteByteGo·32w
How Flipkart Built a Highly Available MySQL Cluster for 150+ Million Users
Flipkart built Altair, an internally managed MySQL service that maintains high availability for 150+ million daily users through automated failover and primary-replica architecture. The system uses a three-layered monitoring approach (agent, monitor, orchestrator) to detect failures, prevent false positives, and execute failovers with minimal data loss. Altair prioritizes write availability over strong consistency using asynchronous replication, implements DNS-based service discovery for seamless failovers, and includes multiple safeguards against split-brain scenarios. The design balances operational simplicity with reliability, achieving near five-nines availability while managing thousands of database clusters across Flipkart's microservices infrastructure.
120
2
Article
Salesforce Engineering·29w
Architecting Multi-System Production Platform
Salesforce built Digital Wallet, a consumption-based pricing platform serving 15,000+ organizations and generating $400M+ in annual contract value. The engineering team overcame significant challenges as Data Cloud's first customer, including implementing SOX-compliant metadata security through Strict System Mode, building a custom event subscriber processing 20M daily events, and architecting failover strategies for near real-time usage tracking. The platform integrates multiple systems using fan-out mechanisms for entitlement sync, implements Spark job failover between EMR-on-EKS and EMR-on-EC2 to avoid rate limits, and maintains billing accuracy through architectural separation of hourly customer-facing updates from monthly financial reconciliation. The system includes high-cardinality monitoring, automatic retry logic, and a month-long buffer for usage reconciliation before billing.
106
3
Article
Netflix TechBlog·31w
How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data Streams at Internet Scale
Netflix built a Real-Time Distributed Graph (RDG) to analyze member interactions across different business verticals like streaming, gaming, and live events. The system processes over 1 million Kafka messages per second using Apache Flink jobs that transform events into graph nodes and edges, writing more than 5 million records per second to storage. The architecture evolved from a monolithic Flink job to a 1:1 mapping between Kafka topics and Flink jobs for better operational stability and tuning. This first part covers the ingestion and processing pipeline, with future posts planned for storage and serving layers.
82
4
Video
Continuous Delivery·33w
Why Are Software Engineers Quitting Microservices?
Explores the recent discourse around developers abandoning microservices, analyzing whether this trend is real and justified. Examines a widely-cited Amazon case study that moved from microservices to a monolith, questioning whether their original implementation was truly microservices-based. Discusses the inherent complexity of microservices, including distributed systems challenges, the need for sophisticated development practices like continuous delivery and contract testing, and proper service boundaries aligned with business capabilities. Argues that microservices remain the best solution for scaling development teams but come with significant overhead that makes them unsuitable for small teams, emphasizing that success requires careful design of service interfaces and organizational decentralization.
74
6
5
Article
System Design Newsletter·33w
System Design Interview: Design Spotify
A comprehensive guide to designing a music streaming platform like Spotify for system design interviews. Covers architecture components including blob storage for audio files, SQL databases for metadata, CDN integration, and API design. Explores capacity planning for 500K users and 30M songs, read/write workflows, and scaling strategies like database replication, sharding, and horizontal scaling. Includes practical considerations for audio delivery, caching, reliability patterns, and monitoring metrics.
68
2
6
Article
ByteByteGo·33w
How Airbnb Runs Distributed Databases on Kubernetes at Scale
Airbnb deployed distributed SQL databases across multiple Kubernetes clusters, each mapped to a different AWS Availability Zone, to achieve high availability and fault tolerance. They built custom Kubernetes operators to safely manage stateful workloads, coordinate node replacements, and maintain quorum during failures. Using AWS EBS for persistent storage, PVCs for volume management, and techniques like replica reads and stale reads, they mitigated latency issues while maintaining consistency. Their largest production cluster handles 3 million queries per second across 150 nodes with 300TB of data, achieving 99.95% availability through careful sequencing of upgrades, canary deployments, and overprovisioning for resilience.
67
1
7
Article
System Design Codex·30w
How Paypal Built JunoDB?
PayPal open-sourced JunoDB, a distributed key-value store handling 350 billion daily requests with 99.9999% availability. Built in Go for multi-core support and CPU-intensive operations like encryption, JunoDB uses a proxy-based architecture with RocksDB for storage. Unlike Redis's single-threaded, memory-bound design, JunoDB is optimized for CPU-bound workloads. Common use cases include caching with flexible TTLs, implementing idempotency for payment processing, tracking counters for rate limiting, and bridging replication latency between data centers in active-active database configurations.
65
8
Article
Netflix TechBlog·30w
Behind the Streams: Real-Time Recommendations for Live Events Part 3
Netflix engineered a real-time recommendation system to handle live event streaming at massive scale, serving over 100 million concurrent devices. The solution uses a two-phase approach: prefetching recommendations and metadata during natural browsing patterns before events, then broadcasting low-cardinality state updates via WebSocket when events start. This architecture solves the thundering herd problem by distributing load over time and minimizing real-time compute requirements. The system leverages GraphQL schemas, Apache Kafka, and a two-tier pub/sub architecture to deliver updates in under a minute during peak load, while adaptive traffic prioritization and cache jitter prevent unexpected traffic spikes.
46
9
Article
ClickHouse·30w
How Netflix optimized its petabyte-scale logging system with ClickHouse
Netflix processes 5 petabytes of logs daily using ClickHouse, handling 10.6 million events per second with sub-second query performance. Three key optimizations enabled this scale: replacing regex-based log fingerprinting with generated lexers (8-10x faster), implementing custom native protocol serialization for efficient data ingestion, and sharding tag maps to reduce query times from 3 seconds to 700ms. The system combines ClickHouse for hot data with Apache Iceberg for long-term storage, making logs searchable within 20 seconds while serving 500-1,000 queries per second across 40,000+ microservices.
40
1
10
Article
ByteByteGo·30w
How Pinterest Transfers Hundreds of Terabytes of Data With CDC
Pinterest built a unified Change Data Capture platform to handle thousands of database shards and millions of queries per second. The system uses Debezium and Apache Kafka with a two-layer architecture: a control plane that manages connector configurations and a data plane that streams database changes. Key challenges included out-of-memory errors from large backlogs, frequent task rebalancing causing instability, slow failover recovery taking over two hours, and duplicate tasks from a Kafka bug. Solutions involved bootstrapping from latest offsets, increasing rebalance timeouts to 10 minutes, enabling worker-level shard discovery, and upgrading to Kafka 2.8.2 version 3.6, which reduced CPU usage from 99% to 45% and stabilized the system to run 3,000 tasks reliably.
39
11
Article
Hacker News·33w
Why TigerBeetle is the most interesting database in the world
TigerBeetle is a financial transactions database built from scratch with modern distributed systems principles. It uses debits and credits as first-class primitives instead of SQL, achieving massive performance gains by packing 8,190 transactions into a single query. The database is written in Zig, distributed by default with Viewstamped Replication consensus, and handles storage faults through Protocol-Aware Recovery. TigerBeetle's development leveraged Deterministic Simulation Testing (DST) via their VOPR cluster running on 1,000 CPU cores, enabling them to build a Jepsen-validated system in just 3.5 years. Their engineering methodology, called TigerStyle, emphasizes assertions, static memory allocation, zero dependencies, and thinking deeply about performance during the design phase.
32
12
Article
InfoQ·29w
From Outages to Order: Netflix’s Approach to Database Resilience with WAL
Netflix implemented a Write-Ahead Log (WAL) system to enhance data platform resilience by capturing database mutations in a durable log before applying them downstream. The modular architecture decouples producers from consumers, uses SQS and Kafka with dead-letter queues, and supports delay queues, cross-region replication, and multi-table atomic mutations. The system addresses data loss, replication entropy, multi-partition failures, and corruption while maintaining consistency and recoverability during outages. Similar patterns are emerging industry-wide, with DoorDash presenting their Write-Ahead Intent Log for efficient Change Data Capture at scale.
31
13
Article
Fly.io·30w
Corrosion
Corrosion is an open-source distributed service discovery system built with Rust, SQLite, and CRDTs that uses gossip protocols instead of distributed consensus. Developed by Fly.io to solve global state synchronization across their platform, it propagates SQLite databases using SWIM-based gossip and cr-sqlite for conflict resolution. The article details major outages caused by the system, including a deadlock bug that locked up their entire proxy fleet, and describes their iterative improvements: watchdogs for event-loop stalls, extensive testing with Antithesis, eliminating partial updates, and regionalizing clusters to reduce blast radius.
26
2
14
Article
Architecture Weekly·32w
On Messaging and Distributed Systems with Ian Cooper
Interview with Ian Cooper exploring messaging patterns and distributed systems design. Covers why messaging remains complex despite maturity, strategies for defining system boundaries using data inside vs outside perspectives, testing approaches for distributed architectures, and practical insights from Cooper's experience teaching and implementing event-driven systems.
25
15
Article
System Design Newsletter·29w
How Remote Procedure Call Works
Remote Procedure Call (RPC) enables services to communicate efficiently by making remote function calls feel like local ones. The protocol uses client stubs and server skeletons to handle marshaling, network transmission, and unmarshaling automatically. Key failure handling strategies include timeouts, retries with idempotency safeguards, circuit breakers, and deadline propagation. Real-world implementations require service discovery, backward-compatible API evolution, streaming support, and standardized error codes. While RPC offers high performance and simple programming models through frameworks like gRPC, it creates tight coupling and requires specialized tooling compared to REST APIs.
23
16
Article
Three Dots Labs·31w
Durable Background Execution with Go and SQLite
Durable execution guarantees correct program output despite system failures. This guide demonstrates implementing durable execution in Go using SQLite and Watermill by persisting input events immediately after validation, ensuring idempotent event handlers through database constraints and deduplication, and proving atomicity with chaos engineering tests. The approach uses SQLite's persistent storage as a drop-in replacement for ephemeral backends, combined with the outbox pattern and single-transaction handlers to achieve reliable recovery from network outages, service failures, and other interruptions.
19
17
Article
Javarevisited·29w
Why Even Senior Engineers Struggle to Pass the System Design Interview?
Senior engineers often fail system design interviews despite real-world experience because interview success requires structured thinking, clear communication, and trade-off reasoning under time pressure—skills distinct from day-to-day work. Success comes from mastering frameworks that break problems into components (requirements, APIs, data modeling, scalability), practicing with realistic scenarios, and learning to articulate design decisions. Key preparation strategies include using structured learning platforms, practicing mock interviews, understanding fundamentals like caching and consistency models, and internalizing trade-offs rather than memorizing patterns.
18
18
Article
Tech World With Milan·29w
How Google, Amazon, and CrowdStrike broke millions of systems
Deep dive into three major 2025 cloud outages: AWS's DNS race condition that cascaded through 113 services for 15 hours, Google Cloud's null pointer exception in Service Control that crashed 50+ services globally for 7 hours, and CrowdStrike's kernel driver bug that locked 8.5 million Windows machines in boot loops. Each incident reveals critical lessons about race conditions, dependency chains, deployment strategies, and the fragility of centralized control planes at hyperscale. Includes technical root cause analysis, cascading failure patterns, and actionable takeaways for building resilient distributed systems.
18
19
Article
The Polymathic Engineer·30w
Caching in Distributed Systems - Part III
Explores production challenges when running distributed caches, including cache avalanche (simultaneous expiration causing database overload), thundering herd (multiple clients rebuilding same cache entry), cache penetration (requests for non-existent data), hot key problems (uneven traffic distribution), large key issues (oversized cache entries), cold start scenarios (empty cache after restart), and consistency challenges between cache and database. Provides practical solutions like staggered expiration, request coalescing, bloom filters, cache warming, and TTL-based safety nets.
18

See all Distributed Systems archives