Best of Apache Flink2024

  1. 1
    Article
    Avatar of detlifeData Engineer Things·1y

    Apache Flink Overview

    Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. It excels in real-time processing with a model centered on streams, using components such as Dispatcher, JobManager, ResourceManager, and TaskManager. Flink differentiates between event-time and processing-time semantics to manage complexities in data flows. It also offers robust state management and checkpointing to ensure fault tolerance. Its architecture supports scalable, high-throughput, and low-latency processing environments, making it suitable for applications involving complex event data.

  2. 2
    Article
    Avatar of tdsTowards Data Science·2y

    How I Dockerized Apache Flink, Kafka, and PostgreSQL for Real-Time Data Streaming

    Achieve seamless integration of Apache Flink, Kafka, and PostgreSQL using Docker-Compose, leveraging pyFlink for real-time data processing. This guide provides practical tips, configures Flink in session mode, and demonstrates how to create custom Docker images for pyFlink to ensure Python jobs run smoothly. Additionally, the post covers setting up Kafka topics, creating Postgres tables, and handling sensor data streams. Follow the step-by-step guide to build and experiment with a streaming pipeline that efficiently processes and stores data.

  3. 3
    Article
    Avatar of collectionsCollections·1y

    How Airbnb Processes a Million User Events Every Second

    Airbnb's User Signals Platform processes over a million user events per second using the Lambda Architecture, combining real-time processing with historical data accuracy. Apache Flink, a stream-processing framework, is pivotal in achieving low latency, fault tolerance, and seamless integration, allowing Airbnb to enhance their recommendation system and drive revenue growth.

  4. 4
    Article
    Avatar of infoworldInfoWorld·2y

    3 data engineering trends riding Kafka, Flink, and Iceberg

    Apache Kafka, Apache Flink, and Apache Iceberg are revolutionizing data management. Kafka enables real-time data movement, Flink processes this data efficiently, and Iceberg structures stored data for query accessibility. Innovations in these open-source tools are shaping data engineering practices, particularly in microservices, AI integration, and community-driven Iceberg tools. Staying informed on these trends ensures proficiency in a rapidly evolving field.

  5. 5
    Article
    Avatar of airbnbAirbnb·2y

    Apache Flink® on Kubernetes

    Airbnb transitioned its stream processing architecture from Apache Hadoop Yarn to Kubernetes. This migration improved developer velocity, job availability, and infrastructure costs. The current setup integrates Flink directly with Kubernetes, offering a better developer experience, secure secrets management, isolated environments, enhanced monitoring, and simpler service discovery. The future focus will involve improving job availability, enabling autoscaling, and utilizing the Flink Kubernetes Operator for streamlined operations.

  6. 6
    Article
    Avatar of communityCommunity Picks·2y

    Append-only tables and incremental reads — Jack Vanlightly

    The post discusses the support for append-only tables and incremental reads in various table formats such as Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon. It explains how incremental reads allow compute engines to return new records or changes since the last query. Each table format supports these features differently, with Iceberg and Delta adding new data files without performing data conflict checks, whereas Hudi uses file groups and Paimon uses row-level operations. The post also touches on the performance implications and potential data conflicts with multiple writers.