A comparison of lightweight open-source data engineering stacks versus SaaS platforms like Databricks and Microsoft Fabric. For small projects with a single data source and low data volumes, a self-hosted stack using Kafka, Docker, Prefect, and PostgreSQL can be cheaper and more developer-friendly. The post walks through a concrete Docker Compose setup, explains why PostgreSQL's JSONB support suits bronze-layer ingestion, and shows how Prefect's pure-Python approach enables better code structure, testability, and flexibility (e.g., swapping Spark for Dask). Trade-offs are framed as a choice between an all-inclusive holiday (SaaS) and booking everything yourself (lightweight), with guidance on when each approach makes sense.
Sort: