Best of Data Engineering — 2025

1
Article
SwirlAI·1y
The evolution of Modern RAG Architectures.
The post delves into the evolution of Retrieval Augmented Generation (RAG) architectures, discussing their development from Naive RAG to advanced techniques, including Cache Augmented Generation (CAG) and Agentic RAG. It highlights the challenges addressed at each stage, advanced methods to improve accuracy, and the potential future advancements in RAG systems.
308
5
2
Article
Materialized View·51w
Kafka: The End of the Beginning
Apache Kafka has dominated streaming data for over a decade, but innovation has stagnated while batch processing has evolved rapidly. The streaming ecosystem faces challenges with slow growth, long sales cycles, and lack of new ideas. While Kafka's protocol has become the de facto standard, its architecture shows limitations for modern cloud-native requirements. New solutions like S2 are emerging with fresh approaches, and the next decade could see a transition similar to how batch processing moved beyond Hadoop, potentially ushering in a truly cloud-native streaming era.
293
6
3
Article
Medium·1y
Building a TikTok-like recommender
A comprehensive guide on building a TikTok-like real-time personalized recommender system, detailing the architecture, including the 4-stage recommender model, and the two-tower neural network design. It uses an H&M retail dataset for practical application, teaches feature engineering, model training, and serving using the Hopsworks AI Lakehouse. The post is part of an open-source course focused on deploying scalable recommenders.
246
3
4
Article
SwirlAI·1y
Building Deep Research Agent from scratch
The post guides readers through building a Deep Research Agent using the DeepSeek R1 model. It explains the concept of Deep Research Agents, outlines their components and steps involved, and provides a thorough implementation guide using SambaNova's platform. The setup includes planning the research, splitting tasks, performing in-depth web searches, reflecting on gathered data, and summarizing results into a final research report. The necessary code and prompts are shared for an interactive learning experience.
232
5
Article
Groww Engineering·27w
When Two Databases Become One: How DuckDB Saved Our Trading Operations from Manual Reconciliation
A trading platform faced recurring position-order mismatches across two separate MySQL databases, requiring 20-30 minutes of manual reconciliation by two engineers. By leveraging DuckDB's MySQL scanner extension to perform cross-database joins, they automated the entire process into a 2-3 minute operation running every 15 minutes. The solution eliminated manual intervention, improved accuracy from 85% to 99.9%, and enabled proactive monitoring instead of reactive fixes during market hours.
165
6
6
Article
Data Engineer Things·1y
End to End Data Engineering
This post details the tools, technologies, and concepts essential for data engineering, emphasizing different paths for success based on roles and backgrounds. It highlights the importance of both analytics and infrastructure sides and mentions popular tools like Airflow and Snowflake. The significance of software engineering principles and specific data engineering roles is also discussed.
161
1
7
Article
SwirlAI·1y
Data Pipelines in Machine Learning Systems.
This tutorial guides through implementing a real-time data ingestion pipeline for machine learning systems using FastAPI and Apache Spark. Key steps include writing a FastAPI collector application, downloading and pushing data from the internet to this application, and processing the data via a Spark ETL pipeline managed by Airflow, all deployed on the Nebius AI Cloud platform. The tutorial emphasizes ensuring data quality and integrity at each stage and showcases setting up Kubernetes clusters for high availability and managed data operations.
138
8
Article
Data Engineer Things·1y
Workflow Orchestration Tools
Workflow orchestration tools like Airflow, Prefect, Windmill, Kestra, Temporal, and Dagster are essential for managing complex processes across automated tasks and systems. Key features include automated task scheduling, error handling, integration with multiple tools, real-time monitoring, and scalability. Each tool has unique strengths: Airflow with its robust community and dynamic workflows, Prefect's cloud-native integration and flexibility, Temporal's advanced workflow management, Kestra's event-driven architecture, Windmill's efficient runtime and low-code builders, and Dagster's asset-centric approach and modular architecture.
132
3
9
Article
Data Engineer Things·47w
Building a Real-Time Flight Data Pipeline with Kafka, Spark, and Airflow
A comprehensive guide to building a real-time flight data pipeline using Kafka for streaming, Spark for processing, and Airflow for orchestration. The pipeline fetches live flight data from a custom API, streams it through Kafka to MongoDB for storage, then uses Airflow to schedule daily ETL jobs that extract landed flight information into PostgreSQL and generate CSV reports. The project includes Docker containerization, complete code examples, and demonstrates end-to-end data engineering practices from real-time ingestion to batch processing and reporting.
119
4
10
Article
ByteByteGo·1y
How Netflix Orchestrates Millions of Workflow Jobs with Maestro
Netflix transitioned from using the Meson orchestrator to Maestro due to scalability issues with the growing volume of data and workflows. Maestro, built with a distributed microservices architecture, efficiently manages large-scale workflows with high reliability and low operational overhead. It supports dynamic workflows, defined via DSLs, a visual UI, or programmatic APIs, and leverages technologies such as CockroachDB and distributed queues. Features like event publishing, parameterized workflows, and an integrated signal service enable Maestro to handle extensive data processing and machine learning tasks at scale.
107
11
Article
Data Engineer Things·1y
Its time to try Kestra
Kestra is presented as an underrated yet powerful workflow orchestrator, boasting a user-friendly UI, YAML-based workflows, comprehensive documentation, and impressive scalability and performance. While it faces challenges such as being relatively new, having a smaller community, and some limitations in advanced features, Kestra’s simplicity and efficiency make it a promising tool for the future of data team workflow orchestration.
105
1
12
Article
Towards Dev·42w
Building a Scalable Real-Time ETL Pipeline with Kafka, Debezium, Flink, Airflow, MinIO, and ClickHouse
A comprehensive guide to building a scalable real-time ETL pipeline using open-source tools including Kafka for data streaming, Debezium for change data capture, Flink for stream processing, ClickHouse as a lakehouse solution, Airflow for orchestration, and MinIO for object storage. The architecture separates hot and cold data layers, with real-time data stored locally for performance and historical data in remote storage for cost optimization. Includes practical implementation steps, Docker configurations, and dashboard creation using Apache Superset.
102
13
Article
Towards AI·1y
End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker
The post provides a detailed guide on building an end-to-end data engineering system using Kafka for data streaming, Spark for data transformation, Airflow for orchestration, PostgreSQL for storage, and Docker for setup and deployment. It is structured into two phases: the first focuses on constructing the data pipeline, while the second will cover creating an application to interact with the database using language models. This project is particularly suited for beginners to data engineering, aiming to deepen their practical knowledge of handling data systems.
99
14
Article
Data Engineering·42w
Data Engineer Project: From Streaming Orders to Batch Insights — A Coffee Shop Chain’s Data Pipeline
A comprehensive data engineering project demonstrates building a complete pipeline for a coffee shop chain that processes real-time orders and provides instant product recommendations while supporting batch analytics. The implementation uses modern tools including Kafka for streaming, Spark for processing, Airflow for orchestration, Delta Lake for storage, Redis for caching, and MinIO for object storage. The project showcases Lakehouse architecture, data quality validation, and SCD Type 2 dimension modeling with full documentation and production-ready simulation.
97
2
15
Article
Tinybird·44w
Why LLMs struggle with analytics
LLMs face significant challenges when working with analytical data, struggling with tabular data interpretation, SQL generation accuracy, and complex database schemas. The key to successful agentic analytics lies in providing comprehensive context through detailed documentation, semantic models, and sample data rather than expecting perfect SQL generation. Building query validation loops with error feedback, using LLM-as-a-judge evaluators, and focusing on business understanding over technical perfection enables more reliable analytical insights.
91
5
16
Article
MotherDuck·22w
Stop Paying the Complexity Tax
Most organizations don't need massive distributed data systems. The industry has over-engineered solutions for edge cases, forcing everyone to pay a complexity tax for scale they'll never require. Modern single-machine databases can handle what previously required distributed systems, with machines now offering 192 cores and 1.5TB of memory. By separating storage (cheap, infinite object storage) from compute (ephemeral, cloneable instances), and designing for the common case of small data with occasional big compute needs, teams can achieve better performance with dramatically simpler architecture. DuckDB exemplifies this approach by focusing on the complete user experience, not just query performance, while MotherDuck extends it with cloud durability and per-user isolation through individual database instances that spin up in under 100ms.
84
17
Article
ByteByteGo·25w
How Netflix Built a Distributed Write Ahead Log For Its Data Platform
Netflix built a distributed Write-Ahead Log (WAL) system to solve data reliability issues across their platform. The WAL captures every data change before applying it to databases, enabling automatic retries, cross-region replication, and multi-partition consistency. Built on top of their Data Gateway Infrastructure, it uses Kafka and Amazon SQS as pluggable backends, supports multiple use cases through namespaces, and scales independently through sharded deployments. The system provides durability guarantees while allowing teams to configure retry logic, delays, and targets without code changes.
82
18
Article
Tinybird·1y
Writing tests sucks. Use LLMs so it sucks less.
The post discusses the challenges and solutions for testing in data engineering. It highlights several key obstacles, such as data variability, complex transformations, and lack of tooling. Tinybird aims to address these issues with tools like 'tb mock' for generating realistic test data, and 'tb test' for validating data transformations. The use of LLMs to handle mundane aspects of test generation is emphasized, making testing less tedious and more efficient.
79
3
19
Article
Supabase·24w
Introducing iceberg-js: A JavaScript Client for Apache Iceberg
Supabase released iceberg-js, an open-source JavaScript/TypeScript client for Apache Iceberg REST Catalog API. The library provides type-safe catalog management for namespaces and tables, works across all JavaScript environments, and is intentionally minimal—it handles only catalog operations, not data reads/writes or query execution. Built to power Supabase's Analytics Buckets feature, it's vendor-agnostic, uses native fetch API, and supports multiple authentication methods. The MIT-licensed library is available on GitHub and npm.
75
20
Article
InfoQ·22w
Decathlon Switches to Polars to Optimize Data Pipelines and Infrastructure Costs
Decathlon migrated data pipelines processing small to mid-size datasets (under 50 GiB) from Apache Spark clusters to Polars running on single Kubernetes pods. The switch reduced compute launch time from 8 to 2 minutes and significantly lowered infrastructure costs. Polars' streaming engine enables processing datasets larger than available memory on modest hardware. The team now uses Polars for new pipelines with stable, smaller input tables that don't require complex joins or aggregations, while keeping Spark for terabyte-scale workloads. Challenges include managing Kubernetes infrastructure and limitations with certain Delta Lake features.
70
21
Article
Data Engineer Things·1y
Apache Airflow Overview
Apache Airflow, created at Airbnb in 2014 and now an open-source project under Apache, is a popular orchestration tool for managing complex data workflows. It operates using Directed Acyclic Graphs (DAGs) to define tasks and their dependencies. Core components include the Scheduler, Web Server, Metadata Database, and Workers. Airflow supports task concurrency, resource management, and integrations with external systems via operators and hooks. It offers various executors for task management, including SequentialExecutor, LocalExecutor, CeleryExecutor, and KubernetesExecutor. Deployment options range from single-machine setups to distributed and Kubernetes-based environments.
69
22
Article
ByteByteGo·1y
EP159: The Data Engineering Roadmap
Data engineering is crucial for effective data analysis. Key components include learning SQL and programming languages, mastering various processing tools, databases, messaging platforms, data lakes, cloud computing platforms, storage systems, orchestration tools, automation, and frontend/dashboarding tools.
68
1
23
Article
ByteByteGo·28w
How Spotify Built Its Data Platform To Understand 1.4 Trillion Data Points
Spotify processes 1.4 trillion data points daily through a sophisticated data platform that evolved from a single Hadoop cluster to a multi-product system running on Google Cloud. The platform consists of three core components: data collection (capturing events from millions of devices using client SDKs and Kubernetes operators), data processing (running 38,000+ automated pipelines using BigQuery, Flink, and Apache Beam), and data management (ensuring privacy, security, and compliance). The architecture emphasizes self-service capabilities, allowing product teams to define event schemas and deploy infrastructure automatically while maintaining centralized governance. Built-in anonymization, lineage tracking, and quality checks ensure data trustworthiness across financial reporting, personalized recommendations, and experimentation systems.
67
24
Article
System Design Newsletter·35w
How Kafka Works
Apache Kafka is a distributed, fault-tolerant pub/sub messaging system built on a simple log data structure. It uses brokers for horizontal scaling, partitions for data sharding, and replication for durability. The system employs KRaft consensus for leader election and metadata management. Key features include tiered storage for cost optimization, consumer groups for parallel processing, transactions for exactly-once semantics, and ecosystem components like Kafka Streams for stream processing and Kafka Connect for system integration.
60
1
25
Article
Daily Dose of Data Science | Avi Chawla | Substack·42w
The Full MLOps/LLMOps Blueprint
MLOps extends beyond model training to encompass the entire production ML system lifecycle, including data pipelines, deployment, monitoring, and infrastructure management. The crash course covers foundational concepts like why MLOps matters, differences from traditional DevOps, and system-level concerns, followed by hands-on implementation of the complete ML workflow from training to API deployment. MLOps applies software engineering and DevOps practices to manage the complex infrastructure surrounding ML code, ensuring reliable delivery of ML-driven features at scale.
59
1

See all Data Engineering archives