Best of Data Streaming — 2024

1
Article
Quastor Daily·2y
How Canva Collects 25 Billion Events Per Day
Canva processes over 25 billion events daily using AWS Kinesis, benefiting from its real-time data analysis and cost-saving features. Their data pipeline involves event batching, compression, and enrichment before routing to Snowflake for further analysis. The switch from AWS SQS to Kinesis significantly reduced their costs by 85%.
73
1
2
Article
ByteByteGo·2y
How PayPal Scaled Kafka to 1.3 Trillion Daily Messages
PayPal scaled Kafka to handle an enormous volume of 1.3 trillion messages per day. They use Kafka for various use cases, such as tracking, database synchronization, and risk detection. PayPal implemented improvements in cluster management to reduce operational overhead.
57
1
3
Article
DEV·2y
Introducing AutoMQ: a cloud-native replacement of Apache Kafka
AutoMQ is a cloud-native replacement for Apache Kafka, designed to address the evolving needs of modern data architectures with a focus on efficiency, scalability, and cost-effectiveness. Originating from a team of open-source pioneers, it offers a unique architecture that decouples storage and computation, leveraging cloud storage to provide significant cost savings and operational efficiency. AutoMQ maintains full compatibility with Kafka, supports multi-cloud environments, and aims to integrate stream data into data lakes to enhance data access and break down silos. The growing community and successful funding highlight its potential impact on the stream storage industry.
55
2
4
Article
ByteByteGo·1y
How LinkedIn Customizes Its 7 Trillion Message Kafka Ecosystem
LinkedIn utilizes Apache Kafka to handle over 7 trillion messages daily, managing this massive scale with over 100 Kafka clusters and more than 4,000 servers. Its Kafka infrastructure includes custom features and enhancements for scalability and operability, tailored through specialized LinkedIn Kafka release branches. LinkedIn maintains unique patches and contributions to the open-source project, ensuring optimal performance and resource utilization for their specific needs, while also sharing improvements with the community.
43
1
5
Article
Towards Data Science·2y
How I Dockerized Apache Flink, Kafka, and PostgreSQL for Real-Time Data Streaming
Achieve seamless integration of Apache Flink, Kafka, and PostgreSQL using Docker-Compose, leveraging pyFlink for real-time data processing. This guide provides practical tips, configures Flink in session mode, and demonstrates how to create custom Docker images for pyFlink to ensure Python jobs run smoothly. Additionally, the post covers setting up Kafka topics, creating Postgres tables, and handling sensor data streams. Follow the step-by-step guide to build and experiment with a streaming pipeline that efficiently processes and stores data.
33
6
Video
Community Picks·2y
Top Kafka Use Cases You Should Know
Explore the top five use cases of Apache Kafka, starting from log analysis to real-time machine learning pipelines, system monitoring and alerting, change data capture (CDC), and system migration. Kafka excels at ingesting and processing high-volume data from multiple sources with low latency, making it invaluable in modern software architecture. Key integrations include the ELK stack for log analysis and Apache Flink and Spark for stream processing.
32
7
Article
The New Stack·2y
Kafka 3.8 Brings Faster Startups to Java Developers
Kafka 3.8, now packaged with GraalVM, promises faster startups and streamlined testing for Java developers. This update improves control over compression schemes, enhancing performance by up to 156%, and introduces support for tiered storage. The Consumer Rebalance Protocol has also been optimized to reduce computational overhead on consumers. Confluent, a major contributor, continues to support Kafka with enterprise and cloud-based services.
21
8
Article
Trendyol Tech·2y
Ensuring Client Continuity in Kafka: Handling Broker Restarts with No Disruptions
Trendyol's Data Streaming team addresses challenges in maintaining uninterrupted Kafka services by leveraging Confluent Stretch Kafka across multiple data centers. The team ensures high availability and fault tolerance by configuring replication factors and monitoring topic configurations. By implementing custom alert mechanisms and offering different topic creation options, they reduce downtime and errors during broker restarts, ensuring client applications remain unaffected.
15
9
Article
Towards Dev·2y
Transmitting Large Kafka Payloads: Best Practices and Strategies
Transmitting large payloads in Apache Kafka can be challenging due to its default 1 MB message size limit. To handle larger messages efficiently, you can increase the message size limits, use compression codecs like LZ4, optimize batching with settings such as `linger.ms` and `batch.size`, split messages into smaller chunks, or offload large data to external stores while using Kafka for metadata. These strategies help maintain high throughput and low latency without straining Kafka's resources.
15
3
10
Article
The New Stack·2y
Top 10 Tools for Kafka Engineers
Explore the top 10 tools used by Kafka engineers to build and maintain efficient Kafka ecosystems. Learn about kcat for real-time monitoring, Debezium for change data capture, Kafka Streams for building stateful applications, Grafana for visualizations, Kafka UIs for easier cluster management, Redpanda as a Kafka-compatible platform, Cruise Control for cluster optimization, Kafka Security Manager for ACL management, MirrorMaker for data replication, and Kafka Proxy for enhanced security.
12
11
Article
Collections·2y
How LinkedIn Scaled Their System to 5 Million Queries Per Second
LinkedIn scaled their Restrictions and Enforcement System to handle 5 million queries per second by using advanced techniques such as BitSets, Bloom Filters, and full refresh-ahead caching strategies. The architecture includes components like the Venice Database and Kafka for real-time data streaming, ensuring high availability, low latency, and efficient memory usage.
11
12
Article
Confluent Blog·2y
Inside the Kafka Black Box—How Producers Prepare Event Data for Brokers
Apache Kafka is a robust distributed event streaming platform ideal for real-time data handling. This detailed guide explores the inner workings of Kafka, focusing on Kafka producers, consumers, and brokers. Key insights include the path data takes from producer to broker, essential configurations, partitioning strategies, batching techniques, and performance metrics to monitor. The aim is to equip developers with the knowledge needed to debug and optimize their Kafka applications.
11
13
Article
Hacker News·2y
Computer Scientists Invent an Efficient New Way to Count
Computer scientists have invented a simple and efficient algorithm to approximate the number of distinct entries in a long list. Named the CVM algorithm, it uses randomization to estimate the number of unique elements. The technique's accuracy scales with the size of the memory, making it a promising solution for the distinct elements problem.
11
14
Article
Data Engineer Things·2y
I spent 8 hours researching WarpStream
WarpStream, a novel messaging system introduced in 2023, aims to address the challenges of managing Kafka infrastructure, particularly in cloud environments. It operates with a Bring Your Own Cloud (BYOC) model, using stateless agents and object storage to reduce operational overhead and costs. WarpStream's architecture separates the data and control planes, ensuring data privacy and efficient, scalable performance. Despite some latency trade-offs and current limitations in feature support compared to Kafka, WarpStream provides a cost-effective streaming solution for many use cases.
10
15
Article
Community Picks·2y
Append-only tables and incremental reads — Jack Vanlightly
The post discusses the support for append-only tables and incremental reads in various table formats such as Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon. It explains how incremental reads allow compute engines to return new records or changes since the last query. Each table format supports these features differently, with Iceberg and Delta adding new data files without performing data conflict checks, whereas Hudi uses file groups and Paimon uses row-level operations. The post also touches on the performance implications and potential data conflicts with multiple writers.
10
1

See all Data Streaming archives