Apache Iceberg is a table format specification that sits between file formats (Parquet, ORC) and compute engines, solving critical failures of the Hive era: slow query planning via directory listing, silent schema corruption from position-based column tracking, and lack of safe concurrent writes. Iceberg replaces directory-based tracking with a versioned metadata tree, enabling file-level statistics for query pruning, ACID transactions via immutable snapshots with atomic commits, and metadata-only schema evolution using permanent column IDs. Unlike a traditional warehouse, Iceberg delivers warehouse-grade guarantees on open, decoupled object storage — but shifts operational responsibility (compaction, snapshot expiration) to the user. Major adoption signals include Snowflake achieving full parity with native tables, AWS S3 Tables with built-in Iceberg support, and companies like LinkedIn and Airbnb reporting significant performance and cost gains.

8m read timeFrom medium.com
Post cover image
Table of contents
So What Actually Broke? The Hive Era’s Dirty Secrets1. Query Planning That Took Longer Than the Query Itself2. Schema Changes That Silently Broke Your Data3. No Safe Way to Update Data

Sort: