Why Apache Iceberg Is Not “Just Another Table Format”

Apache Iceberg is a table format specification that sits between file formats (Parquet, ORC) and compute engines, solving critical failures of the Hive era: slow query planning via directory listing, silent schema corruption from position-based column tracking, and lack of safe concurrent writes. Iceberg replaces directory-based tracking with a versioned metadata tree, enabling file-level statistics for query pruning, ACID transactions via immutable snapshots with atomic commits, and metadata-only schema evolution using permanent column IDs. Unlike a traditional warehouse, Iceberg delivers warehouse-grade guarantees on open, decoupled object storage — but shifts operational responsibility (compaction, snapshot expiration) to the user. Major adoption signals include Snowflake achieving full parity with native tables, AWS S3 Tables with built-in Iceberg support, and companies like LinkedIn and Airbnb reporting significant performance and cost gains.

#big-data

#snowflake

#data-lake

#apache-iceberg

Mar 03•8m read time•From medium.com

Table of contents

So What Actually Broke? The Hive Era’s Dirty Secrets 1. Query Planning That Took Longer Than the Query Itself 2. Schema Changes That Silently Broke Your Data 3. No Safe Way to Update Data

Comment

Bookmark

Copy

Sort: