4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

A data engineering team at Mindbox replaced PySpark pipelines with a declarative stack using dlt, dbt, Trino, and Airflow+Cosmos, enabling analysts and product managers to build data pipelines using only YAML and SQL. The approach reduces delivery time from weeks to one day. The post walks through four YAML/config files: a dlt.yaml for data ingestion, dbt SQL models with sources.yaml and dbt_project.yaml for transformations, and a dag.yaml for Airflow orchestration. Limitations are also covered, including dlt's experimental Delta upsert support, Trino's fault tolerance constraints at terabyte scale, and the difficulty of custom UDFs outside of SQL.

#backend

#apache-airflow

#pyspark

Apr 29•11m read time•From towardsdatascience.com

Table of contents

Why PySpark Was Slowing Us Down What We Replaced PySpark With: YAML and SQL Are All You Need How We Load Data: dlt.yaml How We Transform Data With SQL: dbt_project.yaml and sources.yaml How We Configure Airflow: dag.yaml How Analysts Build Pipelines Without Developers What Changed After the Migration Why the New Stack Doesn’t Fully Replace PySpark What’s Next: Tests, Model Templates, and Training

Comment

Bookmark

Copy

Sort: