A data engineering team at Mindbox replaced PySpark pipelines with a declarative stack using dlt, dbt, Trino, and Airflow+Cosmos, enabling analysts and product managers to build data pipelines using only YAML and SQL. The approach reduces delivery time from weeks to one day. The post walks through four YAML/config files: a dlt.yaml for data ingestion, dbt SQL models with sources.yaml and dbt_project.yaml for transformations, and a dag.yaml for Airflow orchestration. Limitations are also covered, including dlt's experimental Delta upsert support, Trino's fault tolerance constraints at terabyte scale, and the difficulty of custom UDFs outside of SQL.

11m read timeFrom towardsdatascience.com
Post cover image
Table of contents
Why PySpark Was Slowing Us DownWhat We Replaced PySpark With: YAML and SQL Are All You NeedHow We Load Data: dlt.yamlHow We Transform Data With SQL: dbt_project.yaml and sources.yamlHow We Configure Airflow: dag.yamlHow Analysts Build Pipelines Without DevelopersWhat Changed After the MigrationWhy the New Stack Doesn’t Fully Replace PySparkWhat’s Next: Tests, Model Templates, and Training

Sort: