Best of ETL2025

  1. 1
    Article
    Avatar of medium_jsMedium·1y

    Building a modern Data Warehouse from scratch

    Learn how to build a modern data warehouse using SQL Server. The project guides you through designing data architecture with Medallion Architecture, setting up ETL pipelines, developing data models, and creating data analytics and reporting solutions. Key steps include setting up project tools, implementing data quality checks, and creating bronze, silver, and gold layers for data processing and hierarchy. Resources and detailed instructions are provided for each phase.

  2. 2
    Article
    Avatar of towardsdevTowards Dev·42w

    Building a Scalable Real-Time ETL Pipeline with Kafka, Debezium, Flink, Airflow, MinIO, and ClickHouse

    A comprehensive guide to building a scalable real-time ETL pipeline using open-source tools including Kafka for data streaming, Debezium for change data capture, Flink for stream processing, ClickHouse as a lakehouse solution, Airflow for orchestration, and MinIO for object storage. The architecture separates hot and cold data layers, with real-time data stored locally for performance and historical data in remote storage for cost optimization. Includes practical implementation steps, Docker configurations, and dashboard creation using Apache Superset.

  3. 3
    Article
    Avatar of detlifeData Engineer Things·1y

    I spent 6 hours learning AWS Glue. Here is what I found

    AWS Glue is a serverless data integration service that simplifies and automates the ETL process, enabling users to integrate data from various sources, preprocess and transform it, and make it available for analytics. It seamlessly integrates with AWS services like S3, Redshift, and Athena and supports cost-effective and scalable data processing. Key components include Glue Studio, Glue ETL Library with DynamicFrames, and serverless execution with auto-scaling. The Glue Data Catalog acts as a central repository for metadata, facilitating efficient data discovery and management.

  4. 4
    Video
    Avatar of youtubeYouTube·1y

    SQL Data Warehouse from Scratch | Full Hands-On Data Engineering Project

    Learn how to build a modern SQL data warehouse from scratch, incorporating real-world practices used in companies like Mercedes-Benz. The project covers data architecture design, ETL processes, and data modeling basics. By the end, you'll have a professional portfolio project to showcase your skills.

  5. 5
    Article
    Avatar of detlifeData Engineer Things·46w

    How I Built a Reddit Data Pipeline

    A comprehensive guide to building an end-to-end data pipeline that extracts Reddit data, transforms it using AWS Glue, and stores it in S3 for querying with Athena and Redshift Spectrum. The tutorial covers environment setup with Docker and Airflow, infrastructure provisioning using Terraform, and implementing ETL workflows with proper orchestration. Key components include Reddit API integration, AWS services configuration (S3, Glue, Athena, Redshift), and DAG development for automated data processing.

  6. 6
    Article
    Avatar of tinybirdTinybird·1y

    dbt in real-time

    Tinybird offers an alternative to dbt for real-time analytics, simplifying the process of migrating API use cases from dbt. It provides built-in support for real-time processing, API endpoint creation, and simplifies the tech stack by consolidating all data operations. Tinybird uses ClickHouse for faster performance, especially for API responses. Migrating involves mapping dbt concepts to Tinybird equivalents, such as materialized views for incremental updates, and creating optimized data source schemas.

  7. 7
    Article
    Avatar of opensourcesquadOpen Source·1y

    Pyper - Concurrent Python Made Simple

    Pyper is a flexible, pure-Python framework designed for concurrent and parallel data processing. It features an intuitive API that unifies threaded, multiprocessed, and asynchronous work using functional programming principles. Pyper ensures safety by managing underlying task execution and resource clean-up, and it is optimized for efficiency with lazy execution through queues, workers, and generators.

  8. 8
    Article
    Avatar of tigerabrodiTiger's Place·1y

    Data Loading Patterns (data integration)

    Discusses various data loading patterns for data integration, including full snapshot load, incremental load, delta load, and real-time updates. It explains the implementation techniques, key challenges, and use cases for each method, highlighting how they address different efficiency, history tracking, and immediacy requirements.

  9. 9
    Article
    Avatar of javarevisitedJavarevisited·46w

    Top 8 Udemy Courses to Learn Apache Airflow in 2025

    A curated list of 8 Udemy courses for learning Apache Airflow in 2025, ranging from beginner to advanced levels. The courses cover workflow orchestration, DAG creation, cloud deployment, and production-level implementations. Recommendations include Marc Lamberti's hands-on introduction for beginners and advanced courses covering AWS, Docker, and Kubernetes integration for experienced users.

  10. 10
    Article
    Avatar of towardsdevTowards Dev·44w

    Industry-Standard Architecture for Data Engineering Projects

    A comprehensive guide to building scalable data engineering architecture using Azure Data Factory and Databricks. The approach involves extracting CSV files from SharePoint, processing them through bronze and silver data layers, and implementing control tables for pipeline management. Key components include parameterized ADF pipelines, progress tracking metadata tables, and automated error handling to support multiple data interfaces efficiently.

  11. 11
    Article
    Avatar of databricksdatabricks·50w

    Announcing Lakeflow Designer: No-Code ETL, Powered by the Databricks Intelligence Platform

    Databricks introduces Lakeflow Designer, an AI-powered no-code pipeline builder that enables business analysts to create production-ready ETL pipelines without coding. The tool generates standard Lakeflow Declarative Pipelines that data engineers can review and modify, eliminating the typical separation between business and technical teams. Designer leverages AI grounded in organizational data context and provides built-in governance, observability, and scalability within the unified Databricks platform.

  12. 12
    Article
    Avatar of programmingdigestProgramming Digest·47w

    Which Data Architecture Should I Choose for My Workplace? — A Data Engineer’s Approach

    A comprehensive guide comparing four major data architecture approaches: Data Warehouse, Data Lake, Data Lakehouse, and Data Mesh. The article explains when to use each approach, their advantages and challenges, and provides platform recommendations. It focuses on the Medallion Architecture with its Bronze, Silver, and Gold layers for modern data warehouse design, emphasizing the importance of requirement analysis and proper architectural selection based on data types, analytical needs, and organizational structure.

  13. 13
    Article
    Avatar of detlifeData Engineer Things·49w

    Stream Kafka Topic to the Iceberg Tables with Zero-ETL

    AutoMQ introduces Table Topic, an open-source feature that automatically converts Kafka topic messages to Iceberg tables without requiring separate ETL pipelines. The solution addresses the complexity of managing Kafka-to-lakehouse data flows by handling schema management, partitioning, and upsert operations automatically. This represents an evolution from Kafka's original shared-nothing architecture to a shared-data approach, where data is accessible through both Kafka APIs and as Iceberg tables for analytics workloads.

  14. 14
    Article
    Avatar of detlifeData Engineer Things·1y

    Building ETL pipeline using Google Cloud Storage

    The post provides a guide on creating a simple ETL pipeline using Google Cloud Storage to process Zomato restaurant data from Kaggle. It involves extracting, transforming, and loading the data using Python and Google Cloud Storage, offering insights suitable for beginners in data engineering. Key improvements include automation, extension to other cloud services, dashboarding, and data validation.