4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
A data engineering team at Mindbox replaced PySpark pipelines with a declarative stack using dlt, dbt, Trino, and Airflow+Cosmos, enabling analysts and product managers to build data pipelines using only YAML and SQL. The approach reduces delivery time from weeks to one day. The post walks through four YAML/config files: a dlt.yaml for data ingestion, dbt SQL models with sources.yaml and dbt_project.yaml for transformations, and a dag.yaml for Airflow orchestration. Limitations are also covered, including dlt's experimental Delta upsert support, Trino's fault tolerance constraints at terabyte scale, and the difficulty of custom UDFs outside of SQL.
Table of contents
Why PySpark Was Slowing Us DownWhat We Replaced PySpark With: YAML and SQL Are All You NeedHow We Load Data: dlt.yamlHow We Transform Data With SQL: dbt_project.yaml and sources.yamlHow We Configure Airflow: dag.yamlHow Analysts Build Pipelines Without DevelopersWhat Changed After the MigrationWhy the New Stack Doesn’t Fully Replace PySparkWhat’s Next: Tests, Model Templates, and TrainingSort: