Best of Data EngineeringApril 2025

  1. 1
    Article
    Avatar of swirlaiSwirlAI·1y

    The evolution of Modern RAG Architectures.

    The post delves into the evolution of Retrieval Augmented Generation (RAG) architectures, discussing their development from Naive RAG to advanced techniques, including Cache Augmented Generation (CAG) and Agentic RAG. It highlights the challenges addressed at each stage, advanced methods to improve accuracy, and the potential future advancements in RAG systems.

  2. 2
    Article
    Avatar of bytebytegoByteByteGo·52w

    How Netflix Orchestrates Millions of Workflow Jobs with Maestro

    Netflix transitioned from using the Meson orchestrator to Maestro due to scalability issues with the growing volume of data and workflows. Maestro, built with a distributed microservices architecture, efficiently manages large-scale workflows with high reliability and low operational overhead. It supports dynamic workflows, defined via DSLs, a visual UI, or programmatic APIs, and leverages technologies such as CockroachDB and distributed queues. Features like event publishing, parameterized workflows, and an integrated signal service enable Maestro to handle extensive data processing and machine learning tasks at scale.

  3. 3
    Article
    Avatar of bytebytegoByteByteGo·51w

    EP159: The Data Engineering Roadmap

    Data engineering is crucial for effective data analysis. Key components include learning SQL and programming languages, mastering various processing tools, databases, messaging platforms, data lakes, cloud computing platforms, storage systems, orchestration tools, automation, and frontend/dashboarding tools.

  4. 4
    Article
    Avatar of towardsdevTowards Dev·51w

    Building an End-to-End Data Lakehouse with Medalion Architecture, Airflow, and DuckDB

    Learn how to build an end-to-end data lakehouse using Medalion architecture, Apache Airflow, and DuckDB. Understand the roles of the Bronze, Silver, and Gold layers in managing data quality and transformation. Discover why Apache Airflow is ideal for orchestrating workflows and how DuckDB serves as a high-performance analytical database for data warehousing.

  5. 5
    Article
    Avatar of gopenaiGoPenAI·51w

    RAG Database Patterns: Speed, Recall, and Structure

    The post provides a comprehensive overview of database patterns in Retrieval-Augmented Generation (RAG) systems. It emphasizes the importance of using multiple types of databases including vector databases, search databases, document stores, and graph databases to achieve optimal speed, recall, and structure. It also discusses design patterns for integrating these databases, highlighting their individual strengths and limitations, and offers real-world implementation guidance.

  6. 6
    Article
    Avatar of sspdataData Engineering·51w

    Data Engineering Vault: 1000+ Interconnected Concepts for Data Engineers

    The Data Engineering Vault is a curated collection of over 1,000 interconnected concepts designed to form a comprehensive knowledge base for data engineers. It includes detailed notes on the data engineering lifecycle, various data modeling approaches, modern data infrastructure, data transformation paradigms, analytics, and specialized techniques. The vault offers interconnected learning paths, historical context, practical applications, and recommendations for essential resources and thought leaders in the field.

  7. 7
    Article
    Avatar of tinybirdTinybird·51w

    dbt in real-time

    Tinybird offers an alternative to dbt for real-time analytics, simplifying the process of migrating API use cases from dbt. It provides built-in support for real-time processing, API endpoint creation, and simplifies the tech stack by consolidating all data operations. Tinybird uses ClickHouse for faster performance, especially for API responses. Migrating involves mapping dbt concepts to Tinybird equivalents, such as materialized views for incremental updates, and creating optimized data source schemas.

  8. 8
    Article
    Avatar of detlifeData Engineer Things·1y

    Setting Up Airflow with a Custom Docker Image

    Learn how to create a custom Docker image for Apache Airflow, preloaded with dependencies, DAGs, and specific configurations. A step-by-step guide highlights the benefits of customization, such as adding Python libraries, environment-specific setups, and simplified CI/CD processes. Validate the setup to ensure your tailored Airflow environment functions perfectly.

  9. 9
    Article
    Avatar of detlifeData Engineer Things·1y

    Connecting Airflow to MongoDB: A Complete Guide

    This guide covers how to set up and use MongoDB in Apache Airflow, detailing the installation of necessary dependencies, configuring MongoDB connections via UI or CLI, and integrating MongoDB with Airflow DAGs using MongoHook and MongoOperator. It also provides a sample code to fetch data from MongoDB and test the connection.

  10. 10
    Video
    Avatar of youtubeYouTube·1y

    SQL Full Course for Beginners (30 Hours) – From Zero to Hero

    The course, led by Barzalini, covers SQL from the basics to advanced techniques including window functions, stored procedures, and database optimization. Suitable for data engineers, analysts, scientists, and students, it offers extensive materials and is entirely free. The training includes step-by-step instructions, animated visuals for complex concepts, and practical projects such as data warehousing and analytics.