Materialize's Field CTO explains the engineering challenges of streaming live, transactionally consistent operational data into Apache Iceberg, which was designed for batch ETL. Key problems addressed include: avoiding memory-heavy buffering by minting batch descriptions ahead of time so workers can stream data to S3 immediately; handling CDC deletes efficiently using position deletes (cheap) vs. equality deletes (expensive) with an in-memory primary key map; storing recovery state directly in Iceberg snapshot properties to avoid external coordination systems; managing thousands of empty snapshots needed to track frontier progress; and the unsolved problem of multi-table transactional consistency, which Iceberg's spec doesn't currently support.
Table of contents
How Materialize Thinks About ConsistencyThe Naive ApproachMinting Batch Descriptions Ahead of TimeThe Delete ProblemRecovery Without External StateThe Empty Snapshot ProblemMulti-Table TransactionsSort: