How to Build an Open Source Data Lake for Batch Ingestion

A hands-on guide to building a self-hosted, open-source data lake for batch ingestion using RustFS (S3-compatible object storage), Apache Iceberg (table format), Project Nessie (Git-like catalog), Apache Spark (PySpark jobs), and Apache Airflow (orchestration) — all running on Docker Compose. The tutorial walks through four progressively complex pipelines, from a hello-world DAG to a real-world scraper pipeline that fetches Binance trade data via a Redis-decoupled crawler. Key integration pitfalls are documented: Nessie namespace bootstrapping requirements, Quarkus S3 credential configuration quirks, Spark deploy mode limitations on standalone clusters, and NULL partition values caused by using the DataSource V1 write API instead of DataFrameWriterV2. The guide also covers the architectural rationale for decoupling the web crawler from Airflow via Redis, using a fixed signal table pattern for schema ownership, and paths forward for adding transform, analytics, and governance layers.

#python

#apache-spark

#data-lake

#apache-airflow

#apache-iceberg

Apr 16•21m read time•From freecodecamp.org

Table of contents

The Ingestion Problem Stack System Overview Quick Start Running the Pipelines Setup Path Forward

Comment

Bookmark

Copy

Sort: