Best of Data Engineering — March 2025

1
Article
Medium·1y
Building a TikTok-like recommender
A comprehensive guide on building a TikTok-like real-time personalized recommender system, detailing the architecture, including the 4-stage recommender model, and the two-tower neural network design. It uses an H&M retail dataset for practical application, teaches feature engineering, model training, and serving using the Hopsworks AI Lakehouse. The post is part of an open-source course focused on deploying scalable recommenders.
246
3
2
Article
SwirlAI·1y
Building Deep Research Agent from scratch
The post guides readers through building a Deep Research Agent using the DeepSeek R1 model. It explains the concept of Deep Research Agents, outlines their components and steps involved, and provides a thorough implementation guide using SambaNova's platform. The setup includes planning the research, splitting tasks, performing in-depth web searches, reflecting on gathered data, and summarizing results into a final research report. The necessary code and prompts are shared for an interactive learning experience.
232
3
Article
Data Engineer Things·1y
Workflow Orchestration Tools
Workflow orchestration tools like Airflow, Prefect, Windmill, Kestra, Temporal, and Dagster are essential for managing complex processes across automated tasks and systems. Key features include automated task scheduling, error handling, integration with multiple tools, real-time monitoring, and scalability. Each tool has unique strengths: Airflow with its robust community and dynamic workflows, Prefect's cloud-native integration and flexibility, Temporal's advanced workflow management, Kestra's event-driven architecture, Windmill's efficient runtime and low-code builders, and Dagster's asset-centric approach and modular architecture.
132
3
4
Article
Tinybird·1y
Writing tests sucks. Use LLMs so it sucks less.
The post discusses the challenges and solutions for testing in data engineering. It highlights several key obstacles, such as data variability, complex transformations, and lack of tooling. Tinybird aims to address these issues with tools like 'tb mock' for generating realistic test data, and 'tb test' for validating data transformations. The use of LLMs to handle mundane aspects of test generation is emphasized, making testing less tedious and more efficient.
79
3
5
Article
dltHub·1y
Why Iceberg + Python is the Future of Open Data Lakes
Apache Iceberg, combined with Python, is revolutionizing data lakes by addressing issues like ACID transactions, schema evolution, and the need for open, vendor-agnostic solutions. Netflix, Apple, and Adobe are early adopters, and the technology is supported by Trino, Snowflake, and BigQuery. Iceberg's open ecosystem and composability allow seamless integration without overhauling existing systems. This approach is crucial for AI and machine learning, providing efficient and structured data for scalable and cost-effective workloads.
54
6
Video
YouTube·1y
SQL Data Warehouse from Scratch | Full Hands-On Data Engineering Project
Learn how to build a modern SQL data warehouse from scratch, incorporating real-world practices used in companies like Mercedes-Benz. The project covers data architecture design, ETL processes, and data modeling basics. By the end, you'll have a professional portfolio project to showcase your skills.
54
7
Article
Community Picks·1y
feast-dev/feast: The Open Source Feature Store for AI/ML
Feast (Feature Store) is an open source feature store for machine learning that helps manage infrastructure to productionize analytic data for model training and online inference. It provides a unified platform for making features available for training and serving, avoiding data leakage, and decoupling ML from data infrastructure. Feast supports various data sources, offline and online stores, and includes capabilities for feature engineering, feature serving, and data quality management.
26
8
Article
Open Source·1y
CocoIndex - ETL to prepare fresh data for AI, like LEGO
CocoIndex is an open-source ETL tool designed to prepare data for AI applications such as semantic search and retrieval-augmented generation. It features a data-driven programming model, custom transformation logic, and incremental updates. Built on a Rust core with a Python SDK, CocoIndex allows users to build indexing pipelines using a modular, Lego-like approach, ensuring data consistency and minimal re-computation.
23

See all Data Engineering archives