Materialize's Field CTO explains the engineering challenges of streaming live, transactionally consistent operational data into Apache Iceberg, which was designed for batch ETL. Key problems addressed include: avoiding memory-heavy buffering by minting batch descriptions ahead of time so workers can stream data to S3 immediately; handling CDC deletes efficiently using position deletes (cheap) vs. equality deletes (expensive) with an in-memory primary key map; storing recovery state directly in Iceberg snapshot properties to avoid external coordination systems; managing thousands of empty snapshots needed to track frontier progress; and the unsolved problem of multi-table transactional consistency, which Iceberg's spec doesn't currently support.

10m read timeFrom materialize.com
Post cover image
Table of contents
How Materialize Thinks About ConsistencyThe Naive ApproachMinting Batch Descriptions Ahead of TimeThe Delete ProblemRecovery Without External StateThe Empty Snapshot ProblemMulti-Table Transactions

Sort: