Apache Impala participated in the One Trillion Row Challenge, processing 2.4TB of temperature data across 100,000 files using a simple SQL query. The benchmark compared performance against Coiled/Dask, ClickHouse, and Databricks, with Impala achieving competitive results using its MPP architecture. The challenge involved calculating min, mean, and max temperatures per weather station, with Impala scaling from single to 64 executors on AWS r5d.4xlarge instances. The team also extended the challenge to test Iceberg integration with row-level modifications, processing ~70 billion deleted/modified records.

5m read timeFrom itnext.io
Post cover image
Table of contents
The One Trillion Row challenge with Apache ImpalaThe challengeThe ResultCoiled / DaskClickhouseDatabricksApache ImpalaOne more thing…Final thoughtsResourcesAuthors

Sort: