A hands-on guide to building a self-hosted, open-source data lake for batch ingestion using RustFS (S3-compatible object storage), Apache Iceberg (table format), Project Nessie (Git-like catalog), Apache Spark (PySpark jobs), and Apache Airflow (orchestration) — all running on Docker Compose. The tutorial walks through four progressively complex pipelines, from a hello-world DAG to a real-world scraper pipeline that fetches Binance trade data via a Redis-decoupled crawler. Key integration pitfalls are documented: Nessie namespace bootstrapping requirements, Quarkus S3 credential configuration quirks, Spark deploy mode limitations on standalone clusters, and NULL partition values caused by using the DataSource V1 write API instead of DataFrameWriterV2. The guide also covers the architectural rationale for decoupling the web crawler from Airflow via Redis, using a fixed signal table pattern for schema ownership, and paths forward for adding transform, analytics, and governance layers.
Table of contents
The Ingestion ProblemStackSystem OverviewQuick StartRunning the PipelinesSetupPath ForwardSort: