Best of Kafka — October 2025

1
Article
Hacker News·32w
From Millions to Billions
Geocodio migrated their request logging system from MariaDB with deprecated TokuDB to ClickHouse Cloud after hitting performance issues at billions of monthly requests. The solution involved introducing Kafka for event streaming and Vector for batch processing, moving from individual row inserts to batched inserts of 30k-50k records. The migration strategy used feature flags to run both systems in parallel, enabling zero-downtime validation before fully switching over.
110
2
2
Article
Netflix TechBlog·31w
How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data Streams at Internet Scale
Netflix built a Real-Time Distributed Graph (RDG) to analyze member interactions across different business verticals like streaming, gaming, and live events. The system processes over 1 million Kafka messages per second using Apache Flink jobs that transform events into graph nodes and edges, writing more than 5 million records per second to storage. The architecture evolved from a monolithic Flink job to a 1:1 mapping between Kafka topics and Flink jobs for better operational stability and tuning. This first part covers the ingestion and processing pipeline, with future posts planned for storage and serving layers.
82
3
Article
Netflix TechBlog·31w
Behind the Streams: Real-Time Recommendations for Live Events Part 3
Netflix engineered a real-time recommendation system to handle live event streaming at massive scale, serving over 100 million concurrent devices. The solution uses a two-phase approach: prefetching recommendations and metadata during natural browsing patterns before events, then broadcasting low-cardinality state updates via WebSocket when events start. This architecture solves the thundering herd problem by distributing load over time and minimizing real-time compute requirements. The system leverages GraphQL schemas, Apache Kafka, and a two-tier pub/sub architecture to deliver updates in under a minute during peak load, while adaptive traffic prioritization and cache jitter prevent unexpected traffic spikes.
46
4
Article
ByteByteGo·31w
How Pinterest Transfers Hundreds of Terabytes of Data With CDC
Pinterest built a unified Change Data Capture platform to handle thousands of database shards and millions of queries per second. The system uses Debezium and Apache Kafka with a two-layer architecture: a control plane that manages connector configurations and a data plane that streams database changes. Key challenges included out-of-memory errors from large backlogs, frequent task rebalancing causing instability, slow failover recovery taking over two hours, and duplicate tasks from a Kafka bug. Solutions involved bootstrapping from latest offsets, increasing rebalance timeouts to 10 minutes, enabling worker-level shard discovery, and upgrading to Kafka 2.8.2 version 3.6, which reduced CPU usage from 99% to 45% and stabilized the system to run 3,000 tasks reliably.
39
5
Article
InfoQ·29w
From Outages to Order: Netflix’s Approach to Database Resilience with WAL
Netflix implemented a Write-Ahead Log (WAL) system to enhance data platform resilience by capturing database mutations in a durable log before applying them downstream. The modular architecture decouples producers from consumers, uses SQS and Kafka with dead-letter queues, and supports delay queues, cross-region replication, and multi-table atomic mutations. The system addresses data loss, replication entropy, multi-partition failures, and corruption while maintaining consistency and recoverability during outages. Similar patterns are emerging industry-wide, with DoorDash presenting their Write-Ahead Intent Log for efficient Change Data Capture at scale.
31
6
Article
Tinybird·31w
Flink is a 95% problem
Apache Flink is marketed as essential for real-time data processing, but it's overkill for 95% of use cases. Most real-time problems can be solved with simpler solutions: HTTP services with Postgres (65%), OLAP databases like ClickHouse (25%), or custom solutions (5%). Only about 5% of companies actually need Flink's complexity. The platform introduces massive operational overhead including new APIs to learn, additional infrastructure (Kafka, ZooKeeper/K8s), 700+ configuration parameters, complex observability requirements, and JVM dependency. Even Flink's creators acknowledge its limitations, and recent acquisitions of Flink-based companies suggest limited market traction. For most organizations under 100 developers, simpler alternatives like ClickHouse with SQL or native programming language Kafka consumers provide better cost-benefit tradeoffs without the engineering complexity.
27
1
7
Article
ClickHouse·33w
Inside Laravel Nightwatch’s Observability Pipeline: Real-Time Event Processing with Amazon MSK and ClickHouse Cloud
Laravel Nightwatch processes over 1 billion observability events daily using Amazon MSK and ClickHouse Cloud. The platform combines MSK Express brokers for event ingestion, ClickPipes for streaming data to ClickHouse, and AWS Lambda for validation. This architecture achieves sub-second query latency while handling millions of events per second. On launch day, the system processed 500 million events with 97ms average dashboard latency for 5,300 users. The dual-database design separates transactional workloads on RDS PostgreSQL from analytical workloads on ClickHouse Cloud, enabling horizontal scaling and cost-effective real-time monitoring at global scale.
13
1

See all Kafka archives