Best of Kafka — February 2026

1
Article
ByteByteGo·15w
How LinkedIn Built a Next-Gen Service Discovery for 1000s of Services
LinkedIn replaced its decade-old Zookeeper-based service discovery system with a next-generation architecture using Kafka for writes and gRPC/xDS for reads. The new system handles hundreds of thousands of service instances with 10x better median latency (P50 < 1s vs 10s) and 6x better P99 latency. Key improvements include horizontal scalability through Go-based Observer components, eventual consistency over strong consistency, multi-language support via xDS protocol, and cross-fabric capabilities. The migration used a dual-mode strategy where applications ran both systems simultaneously, with automated dependency analysis to safely transition thousands of services without downtime.
72
2
Article
Trendyol Tech·13w
Debugging a Go Memory Leak: From OOM to Stable with pprof
A real-world walkthrough of diagnosing and fixing a Go memory leak that caused OOM crashes in a Kafka consumer service. Covers Go memory fundamentals (stack vs. heap, GC tricolor algorithm, TCMalloc-based arenas), then details two concrete fixes: using Uber's automaxprocs to correctly set GOMAXPROCS in Kubernetes containers, and using pprof heap profiling to identify two root causes — repeated time.LoadLocation disk reads fixed with sync.Once, and a memory leak in the Confluent Kafka Go library resolved by switching to the Segmentio kafka-go package, reducing memory from hundreds of MB to a stable 25 MB.
64
2
3
Article
Confluent Blog·13w
Apache Kafka 4.2.0 Released: Share Groups, Streams & More
Apache Kafka 4.2.0 is now available, bringing several major improvements. Share Groups (Kafka Queues) are now production-ready, featuring a new RENEW acknowledgement type for extended processing, adaptive batching for share coordinators, configurable fetch record limits, and comprehensive lag metrics. Kafka Streams gains GA status for its server-side rebalance protocol, dead letter queue support in exception handlers, anchored wall-clock punctuation for deterministic scheduling, and explicit control over leave-group behavior on shutdown. Observability is improved with standardized CLI arguments, corrected metric naming following the kafka.COMPONENT convention, and new idle ratio metrics for controllers and MetadataLoader. Security enhancements include an allowlist connector client configuration override policy and thread-safety fixes to RecordHeader. Additional changes cover external schema support in JsonConverter, dynamic remote log manager thread pool configuration, adaptive batching in group coordinators, and rack ID exposure in the Admin API.
17
4
Article
Collections·13w
Uforwarder: Uber's Scalable Kafka Consumer Proxy for Efficient Event-Driven Microservices
Uber has open-sourced uForwarder, a Kafka consumer proxy built to handle trillions of messages and petabytes of data daily across thousands of downstream services. It replaces direct Kafka consumer clients with a gRPC-based push interface that centralizes offset management. Key features include context-aware routing via Kafka message headers for workload isolation, an out-of-order commit tracker with dead letter queue support to prevent head-of-line blocking, auto-rebalancing based on real-time CPU/memory/throughput metrics, and a DelayProcessManager for partition-level pause/resume control. The result is improved hardware utilization, reduced consumer lag, and better workload isolation at massive scale.
11
5
Article
BigData Boutique blog·13w
Kafka MirrorMaker 2: Deployment, Gotchas, and Disaster Recovery Failback Playbook
A practical guide to deploying Kafka MirrorMaker 2 (MM2) for cluster replication, covering deployment topology, connector configuration, and production gotchas. Key decisions include deploying MM2 alongside the target cluster, choosing between DefaultReplicationPolicy and IdentityReplicationPolicy before going live, and tuning client parameters for high-throughput workloads. Common pitfalls include config drift (sync.topic.configs.enabled is unreliable), and topic recreation on the source causing scrambled offsets that require manual connector reset. The post closes with a detailed failback playbook: validate primary cluster health, establish temporary reverse replication from DR to primary, move consumers before producers, drain replication lag to zero before producer cutover, and always rehearse the procedure in non-production first.
10

See all Kafka archives