Best of Data Engineering — August 2025

1
Article
Towards Dev·42w
Building a Scalable Real-Time ETL Pipeline with Kafka, Debezium, Flink, Airflow, MinIO, and ClickHouse
A comprehensive guide to building a scalable real-time ETL pipeline using open-source tools including Kafka for data streaming, Debezium for change data capture, Flink for stream processing, ClickHouse as a lakehouse solution, Airflow for orchestration, and MinIO for object storage. The architecture separates hot and cold data layers, with real-time data stored locally for performance and historical data in remote storage for cost optimization. Includes practical implementation steps, Docker configurations, and dashboard creation using Apache Superset.
102
2
Article
Data Engineering·42w
Data Engineer Project: From Streaming Orders to Batch Insights — A Coffee Shop Chain’s Data Pipeline
A comprehensive data engineering project demonstrates building a complete pipeline for a coffee shop chain that processes real-time orders and provides instant product recommendations while supporting batch analytics. The implementation uses modern tools including Kafka for streaming, Spark for processing, Airflow for orchestration, Delta Lake for storage, Redis for caching, and MinIO for object storage. The project showcases Lakehouse architecture, data quality validation, and SCD Type 2 dimension modeling with full documentation and production-ready simulation.
97
2
3
Article
Daily Dose of Data Science | Avi Chawla | Substack·42w
The Full MLOps/LLMOps Blueprint
MLOps extends beyond model training to encompass the entire production ML system lifecycle, including data pipelines, deployment, monitoring, and infrastructure management. The crash course covers foundational concepts like why MLOps matters, differences from traditional DevOps, and system-level concerns, followed by hands-on implementation of the complete ML workflow from training to API deployment. MLOps applies software engineering and DevOps practices to manage the complex infrastructure surrounding ML code, ensuring reliable delivery of ML-driven features at scale.
59
1
4
Article
Daily Dose of Data Science | Avi Chawla | Substack·39w
Data and Pipeline Engineering for ML Systems (With Implementation)
A comprehensive MLOps crash course covering data and pipeline engineering for ML systems. The series explores data sources, ETL pipelines, model training, deployment, versioning, and reproducibility. It includes hands-on implementations using tools like PyTorch, MLflow, Git, DVC, and Weights & Biases, providing both foundational concepts and practical system-level thinking for production ML environments.
12
5
Article
Towards Dev·42w
Handle Schema Evolution like your job depends on it
Schema evolution in data engineering involves handling structural changes to incoming data without breaking existing pipelines. The solution involves maintaining a schema evolution master table for tracking changes, handling all schema evolution at the bronze layer, using target schemas for crucial columns, and locking schemas in silver and gold layers. A practical implementation includes an align_schema function that adds missing columns as nulls, drops extra columns, and logs all schema changes to a Delta table for monitoring and proof of schema modifications.
10

See all Data Engineering archives