Best of Apache Iceberg2024

  1. 1
    Article
    Avatar of detlifeData Engineer Things·1y

    I spent 4 hours learning how Netflix operates Apache Iceberg at scale.

    Netflix has developed a sophisticated data platform to handle extensive data pipelines and analytics, using Apache Iceberg to overcome the limitations of their previous Hive-based system. Key components include Polaris, a custom metastore for Iceberg, and Janitors, a cleanup service. They also implemented Autotune for optimizing data layout and Autolift for localizing data files. Moreover, secure access controls were established for Iceberg tables. Netflix's migration tool for transitioning from Hive to Iceberg minimizes data movement and business interruptions.

  2. 2
    Article
    Avatar of detlifeData Engineer Things·1y

    How does Netflix ensure the data quality for thousands of Apache Iceberg tables?

    Netflix employs the Write-Audit-Publish (WAP) pattern using Apache Iceberg to maintain high data quality across thousands of tables. The WAP pattern involves writing data to a hidden snapshot, auditing it, and publishing it only if it passes quality checks. This approach is analogous to CI/CD workflows, ensuring validated data is exposed to downstream consumers. Apache Iceberg's structure, including manifest files, metadata files, and catalog, supports efficient snapshot management and branching, similar to version control in Git.

  3. 3
    Article
    Avatar of infoworldInfoWorld·1y

    3 data engineering trends riding Kafka, Flink, and Iceberg

    Apache Kafka, Apache Flink, and Apache Iceberg are revolutionizing data management. Kafka enables real-time data movement, Flink processes this data efficiently, and Iceberg structures stored data for query accessibility. Innovations in these open-source tools are shaping data engineering practices, particularly in microservices, AI integration, and community-driven Iceberg tools. Staying informed on these trends ensures proficiency in a rapidly evolving field.

  4. 4
    Article
    Avatar of detlifeData Engineer Things·2y

    I spent 7 hours diving deep into Apache Iceberg

    This post delves into the internals of the Apache Iceberg file format, covering its data and metadata layers, manifest files, and how it manages read and write operations. It includes details on compaction, hidden partitioning, sorting, and row-level updates with both copy-on-write and merge-on-read modes. The goal is to offer a comprehensive understanding of Iceberg's capabilities and optimizations for managing large datasets efficiently.

  5. 5
    Article
    Avatar of communityCommunity Picks·2y

    Append-only tables and incremental reads — Jack Vanlightly

    The post discusses the support for append-only tables and incremental reads in various table formats such as Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon. It explains how incremental reads allow compute engines to return new records or changes since the last query. Each table format supports these features differently, with Iceberg and Delta adding new data files without performing data conflict checks, whereas Hudi uses file groups and Paimon uses row-level operations. The post also touches on the performance implications and potential data conflicts with multiple writers.