Best of Kafka — 2025

1
Article
Materialized View·51w
Kafka: The End of the Beginning
Apache Kafka has dominated streaming data for over a decade, but innovation has stagnated while batch processing has evolved rapidly. The streaming ecosystem faces challenges with slow growth, long sales cycles, and lack of new ideas. While Kafka's protocol has become the de facto standard, its architecture shows limitations for modern cloud-native requirements. New solutions like S2 are emerging with fresh approaches, and the next decade could see a transition similar to how batch processing moved beyond Hadoop, potentially ushering in a truly cloud-native streaming era.
293
6
2
Article
Foojay.io·1y
Building a Real-Time AI Fraud Detection System with Spring Kafka and MongoDB
This tutorial explains the step-by-step process of building a real-time fraud detection system using Spring Kafka, MongoDB, and AI-generated embeddings. It covers setting up a MongoDB database and creating a vector search index to detect anomalies in transaction data. The guide also illustrates creating synthetic customer profiles and generating transactions to analyze historical patterns for potential fraud, along with optimizing performance strategies.
169
1
3
Article
Datadog·22w
How microservice architectures have shaped the usage of database technologies
Microservices have transformed database usage from monolithic, single-database architectures to distributed systems where organizations run multiple database technologies simultaneously. Analysis of 2.5 million services shows over half of organizations now use both SQL and NoSQL databases side by side, with many adopting 3+ different database technologies. This shift enables teams to choose the right tool for each service but introduces new challenges: fragmented schemas require data integration layers like GraphQL, analytics demands OLAP systems like Snowflake, and service communication relies heavily on message queues like Kafka and RabbitMQ for asynchronous decoupling.
156
1
4
Article
Medium·50w
How Kafka Saved Our Payment System And Helped Us Scale to 10 Million Users
A payment system was failing due to synchronous processing of multiple tasks (email, notifications, logging) in a single thread, causing delays and duplicate charges. The team implemented Kafka as a message broker to decouple services through event-driven architecture. After a payment succeeds, the system publishes a single event to Kafka, allowing independent services to consume and process it asynchronously. This approach eliminated blocking operations, improved response times, reduced support tickets, and enabled the system to scale to 10 million users while maintaining reliability and making it easier to add new features.
144
5
5
Article
Medium·47w
Why We Replaced Kafka with gRPC for Service Communication
A development team replaced Kafka with gRPC for synchronous service communication in their loan servicing platform after experiencing issues with debugging, latency, and operational complexity. While keeping Kafka for appropriate use cases like audit logs and fan-out patterns, they found gRPC provided better performance (70-80% latency reduction), easier debugging, and simpler infrastructure management for request-response interactions. The key lesson was using each tool for its intended purpose rather than forcing one solution everywhere.
135
7
6
Article
Confluent Blog·1y
Designing Event-Driven, Multi-Agent AI Architectures w/Kafka and Flink
Learn how to build an event-driven, multi-agent AI architecture to simplify meal planning. Using tools like Kafka, Flink, LangChain, and Claude, the system coordinates multiple AI agents—each with specialized tasks like planning child-friendly or adult meals—into a cohesive meal plan. The approach ensures real-time responsiveness, adaptability, and fault tolerance, making complex daily tasks more manageable.
120
7
Article
Data Engineer Things·47w
Building a Real-Time Flight Data Pipeline with Kafka, Spark, and Airflow
A comprehensive guide to building a real-time flight data pipeline using Kafka for streaming, Spark for processing, and Airflow for orchestration. The pipeline fetches live flight data from a custom API, streams it through Kafka to MongoDB for storage, then uses Airflow to schedule daily ETL jobs that extract landed flight information into PostgreSQL and generate CSV reports. The project includes Docker containerization, complete code examples, and demonstrates end-to-end data engineering practices from real-time ingestion to batch processing and reporting.
119
4
8
Article
Debezium·25w
CQRS Design Pattern
Command Query Responsibility Segregation (CQRS) separates read and write operations into distinct data models and databases, enabling independent scaling, improved security, and better performance. The pattern addresses challenges in both monolithic and microservice architectures by decoupling data access patterns. Implementation approaches range from database-native replication (like PostgreSQL streaming replication) to Change Data Capture solutions using Debezium. A practical voting application demonstrates three replication scenarios: PostgreSQL-to-PostgreSQL using native replication, PostgreSQL-to-MySQL using Debezium JDBC sink connector, and PostgreSQL-to-QuestDB using vendor-specific sink connectors. Debezium simplifies CQRS implementation in heterogeneous environments by capturing database changes in real-time and propagating them across different database technologies through Kafka Connect.
118
1
9
Article
Hacker News·32w
From Millions to Billions
Geocodio migrated their request logging system from MariaDB with deprecated TokuDB to ClickHouse Cloud after hitting performance issues at billions of monthly requests. The solution involved introducing Kafka for event streaming and Vector for batch processing, moving from individual row inserts to batched inserts of 30k-50k records. The migration strategy used feature flags to run both systems in parallel, enabling zero-downtime validation before fully switching over.
110
2
10
Article
Anubhav Bhatt·28w
Pub/Sub Model Saved Our Insurance System from Collapse
A tightly coupled insurance policy activation system was failing when any downstream service experienced an outage. By refactoring from sequential service calls to an event-driven pub/sub architecture using Kafka, the system became resilient and decoupled. Each service now independently subscribes to policy activation events, allowing failures to be isolated and new services to be added without modifying core logic.
108
12
11
Article
Towards Dev·42w
Building a Scalable Real-Time ETL Pipeline with Kafka, Debezium, Flink, Airflow, MinIO, and ClickHouse
A comprehensive guide to building a scalable real-time ETL pipeline using open-source tools including Kafka for data streaming, Debezium for change data capture, Flink for stream processing, ClickHouse as a lakehouse solution, Airflow for orchestration, and MinIO for object storage. The architecture separates hot and cold data layers, with real-time data stored locally for performance and historical data in remote storage for cost optimization. Includes practical implementation steps, Docker configurations, and dashboard creation using Apache Superset.
102
12
Video
Community Picks·48w
Learn Microservices and Kafka with an E-commerce Example | Kafka Tutorial for beginners
A comprehensive tutorial demonstrating how to transform a monolithic e-commerce application into a microservices architecture using Apache Kafka for inter-service communication. The guide covers breaking down payment, order, email, and analytics services into independent components, implementing Kafka producers and consumers, setting up Docker containers, and creating fault-tolerant Kafka clusters with multiple brokers and partitions. The tutorial includes practical code examples showing how to handle asynchronous messaging, reduce response times from 12 seconds to 3 seconds, and ensure system resilience through distributed architecture.
101
13
Article
theburningmonk.com·1y
Event versioning strategies for event-driven architectures
The post discusses various strategies for versioning event schemas in event-driven architectures, including adding versions in event names, event payloads, separate streams, and using schema registries. It evaluates the pros and cons of each method and suggests alternative approaches to avoid breaking changes. The author emphasizes the importance of supporting backward compatibility and recommends always adding new fields instead of breaking existing schemas.
100
2
14
Article
Towards AI·1y
End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker
The post provides a detailed guide on building an end-to-end data engineering system using Kafka for data streaming, Spark for data transformation, Airflow for orchestration, PostgreSQL for storage, and Docker for setup and deployment. It is structured into two phases: the first focuses on constructing the data pipeline, while the second will cover creating an application to interact with the database using language models. This project is particularly suited for beginners to data engineering, aiming to deepen their practical knowledge of handling data systems.
99
15
Article
Netflix TechBlog·45w
Netflix Tudum Architecture: from CQRS with Kafka to CQRS with RAW Hollow
Netflix migrated their Tudum fan site architecture from a CQRS pattern using Kafka and traditional caching to RAW Hollow, an in-memory compressed object database. The original architecture suffered from eventual consistency delays, taking minutes for content changes to appear. RAW Hollow eliminated the need for separate databases and Kafka infrastructure by storing the entire dataset in memory across application processes, reducing homepage construction time from 1.4 seconds to 0.4 seconds and enabling real-time content previews.
91
16
Article
Netflix TechBlog·34w
Building a Resilient Data Platform with Write-Ahead Log at Netflix
Netflix built a generic Write-Ahead Log (WAL) system to solve data consistency and reliability challenges at scale. The system provides a simple API that abstracts underlying message queues (Kafka, SQS) and supports multiple use cases including delayed queues, cross-region replication, and multi-partition mutations. WAL prevents data loss, handles system entropy across different datastores, and enables reliable retry mechanisms for real-time data pipelines. The architecture separates message producers from consumers, uses configurable namespaces for logical separation, and leverages Netflix's Data Gateway infrastructure for deployment. Key applications include EVCache cross-region replication, Live Origin's delayed delete operations, and Key-Value service's MutateItems API with two-phase commit semantics.
87
17
Article
ByteByteGo·37w
How Netflix Tudum Supports 20 Million Users With CQRS
Netflix redesigned their Tudum platform architecture to support 20 million users by replacing a traditional CQRS implementation with RAW Hollow, an in-memory object store. The original design used Kafka and Cassandra with caching layers, causing delays in editorial previews due to eventual consistency. By embedding RAW Hollow directly into microservices, they eliminated external datastores and reduced page construction time from 1.4 seconds to 0.4 seconds while enabling near-instant editorial previews. The compressed in-memory approach stores three years of data in just 130MB while maintaining strong consistency options for critical workflows.
87
1
18
Article
ByteByteGo·25w
How Netflix Built a Distributed Write Ahead Log For Its Data Platform
Netflix built a distributed Write-Ahead Log (WAL) system to solve data reliability issues across their platform. The WAL captures every data change before applying it to databases, enabling automatic retries, cross-region replication, and multi-partition consistency. Built on top of their Data Gateway Infrastructure, it uses Kafka and Amazon SQS as pluggable backends, supports multiple use cases through namespaces, and scales independently through sharded deployments. The system provides durability guarantees while allowing teams to configure retry logic, delays, and targets without code changes.
82
19
Article
Netflix TechBlog·31w
How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data Streams at Internet Scale
Netflix built a Real-Time Distributed Graph (RDG) to analyze member interactions across different business verticals like streaming, gaming, and live events. The system processes over 1 million Kafka messages per second using Apache Flink jobs that transform events into graph nodes and edges, writing more than 5 million records per second to storage. The architecture evolved from a monolithic Flink job to a 1:1 mapping between Kafka topics and Flink jobs for better operational stability and tuning. This first part covers the ingestion and processing pipeline, with future posts planned for storage and serving layers.
82
20
Article
Platformatic·26w
kafka 223% Faster (And What We Learned Along the Way)
Platformatic improved their Kafka client for Node.js by 223% after discovering their benchmark methodology was flawed. By fixing measurement issues (per-operation timing, proper delivery tracking, larger sample sizes), they identified real bottlenecks including CRC32C computation, error handling, and metadata request bugs. Key optimizations included switching to native Rust CRC32C implementation, refactoring async error handling, and fixing connection handling. The pure JavaScript implementation now achieves 92,441 op/sec for single messages and 159,828 op/sec for consumption with <2% variance, outperforming native librdkafka bindings by avoiding cross-boundary overhead while maintaining minimal buffer copying and non-blocking event loop usage.
73
21
Article
Confluent Blog·1y
How to build a multi-agent orchestrator using Flink and Kafka
The post explores the creation of multi-agent systems using an orchestrator pattern, with Apache Flink and Kafka as key technologies. It highlights the necessity of dividing complex tasks among specialized AI agents for better collaboration and problem-solving. The orchestrator facilitates efficient message routing and real-time decision-making by interpreting and distributing tasks dynamically. The combination of Flink's real-time processing and Kafka's event-driven messaging creates a scalable, adaptable system without rigid dependencies.
69
22
Article
Grafana Labs·29w
Grafana Mimir 3.0 release: performance improvements, a new query engine, and more
Grafana Mimir 3.0 introduces a redesigned architecture that separates read and write operations using Apache Kafka as an asynchronous buffer, eliminating performance bottlenecks between ingestion and queries. The release features the Mimir Query Engine (MQE), which processes queries in a streaming fashion rather than bulk loading, reducing peak memory usage by up to 92%. These improvements deliver 15% lower resource usage in large clusters while maintaining faster query execution and higher reliability. The new ingest storage component ensures query spikes won't slow down data ingestion and vice versa, enabling independent scaling of each path.
47
1
23
Article
Netflix TechBlog·31w
Behind the Streams: Real-Time Recommendations for Live Events Part 3
Netflix engineered a real-time recommendation system to handle live event streaming at massive scale, serving over 100 million concurrent devices. The solution uses a two-phase approach: prefetching recommendations and metadata during natural browsing patterns before events, then broadcasting low-cardinality state updates via WebSocket when events start. This architecture solves the thundering herd problem by distributing load over time and minimizing real-time compute requirements. The system leverages GraphQL schemas, Apache Kafka, and a two-tier pub/sub architecture to deliver updates in under a minute during peak load, while adaptive traffic prioritization and cache jitter prevent unexpected traffic spikes.
46
24
Article
Tinybird·23w
Build a Real-Time E-Commerce Analytics API from Kafka in 15 Minutes
A step-by-step guide to building a real-time e-commerce analytics API using Kafka as the data source. Covers connecting to Kafka, ingesting order events, enriching data with dimension tables and PostgreSQL, creating materialized views for pre-aggregated metrics, and exposing multiple API endpoints. The tutorial progresses from a basic 5-minute setup querying raw Kafka data to advanced features including data enrichment, automated PostgreSQL syncing, and optimized aggregations using materialized views. All implementation uses SQL and configuration without requiring application code.
40
25
Article
ByteByteGo·31w
How Pinterest Transfers Hundreds of Terabytes of Data With CDC
Pinterest built a unified Change Data Capture platform to handle thousands of database shards and millions of queries per second. The system uses Debezium and Apache Kafka with a two-layer architecture: a control plane that manages connector configurations and a data plane that streams database changes. Key challenges included out-of-memory errors from large backlogs, frequent task rebalancing causing instability, slow failover recovery taking over two hours, and duplicate tasks from a Kafka bug. Solutions involved bootstrapping from latest offsets, increasing rebalance timeouts to 10 minutes, enabling worker-level shard discovery, and upgrading to Kafka 2.8.2 version 3.6, which reduced CPU usage from 99% to 45% and stabilized the system to run 3,000 tasks reliably.
39

See all Kafka archives