Best of ETL2024

  1. 1
    Article
    Avatar of hnHacker News·2y

    Python-based ETL

    Amphi is a Python-based ETL tool designed for efficient data extraction, transformation, and loading with a low-code approach. It features a graphical user interface for designing data pipelines, generating deployable native Python code, and supporting various data formats like CSV and JSON. Amphi ensures flexibility, ease of sharing pipeline definitions, and guarantees data privacy as processing is done locally. The platform is aimed at fostering community collaboration among data practitioners of all levels.

  2. 2
    Article
    Avatar of kdnuggetsKDnuggets·2y

    10 Built-In Python Modules Every Data Engineer Should Know

    Python's standard library includes built-in modules that are essential for data engineering tasks. Key modules like os, pathlib, shutil, csv, json, pickle, sqlite3, datetime, re, and subprocess enable efficient file and directory management, data handling and serialization, database interaction, text processing, and more. Utilizing these modules can streamline your data engineering workflows, providing essential functionality without relying on external libraries.

  3. 3
    Article
    Avatar of towardsdevTowards Dev·2y

    Building a Serverless Data Pipeline: A Step-by-Step Guide

    The guide provides step-by-step instructions to build a serverless data pipeline using AWS services. Key components include AWS Lambda for data extraction from the Colombo Stock Market Index API, Amazon Kinesis Data Firehose for data ingestion, Amazon S3 for storage, and AWS Glue for ETL orchestration with Athena for querying data. The pipeline uses event-driven architectures with SQS notifications and Glue crawlers for efficient data processing.

  4. 4
    Article
    Avatar of hnHacker News·2y

    The Great Database Migration

    Shepherd successfully migrated its pricing engine database from SQLite to Postgres with zero downtime. The new architecture improves scalability, performance, and developer experience. The migration included converting synchronous functions to asynchronous, leveraging a serverless architecture with Neon, and automating ETL processes. The project highlighted performance optimizations, including caching strategies and connection pooling, resulting in significantly improved response times.

  5. 5
    Article
    Avatar of mlnewsMachine Learning News·2y

    Top Data Engineering Courses in 2024

    Data engineering is crucial for organizations relying on data-driven insights. This post lists top courses for mastering data engineering skills such as building scalable data solutions, ETL processes, and leveraging technologies like Apache Spark and cloud platforms. Courses include IBM’s Data Engineering Foundations, Meta Database Engineer Professional Certificate, and Google Cloud Database Engineer Specialization, among others.

  6. 6
    Article
    Avatar of detlifeData Engineer Things·1y

    ETL and ELT

    The author reflects on their journey from chasing the latest data engineering tools to focusing on foundational concepts, emphasizing the shift from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform). The traditional ETL process, necessitated by the high costs and limitations of early data warehouses, is contrasted with the modern ELT approach, facilitated by advancements in cloud data warehousing. ELT offers greater flexibility and efficiency by loading raw data into the warehouse and handling transformations within the warehouse, aligning better with agile development practices.

  7. 7
    Article
    Avatar of hnHacker News·2y

    Stripe Data vs Open‐Source Alternatives: a MRR example

    Stripe's API lacks straightforward methods for calculating MRR and necessitates the use of additional costly tools like Stripe Sigma and Stripe Data Pipeline. These tools are ideal for large companies with substantial transactions but it's impractical for smaller transactions due to high costs. Open-source alternatives, such as Lago, provide more flexibility and control over financial data, avoiding dependence on expensive third-party solutions.

  8. 8
    Article
    Avatar of substackSubstack·2y

    Data pipelines and SCDs

    Designing backfillable data pipelines using idempotent transformation code avoids the complications of ad-hoc SQL. When handling Slowly Changing Dimensions (SCDs), SCD Type 2 is preferred for its immutability and compressive qualities, though it involves complex surrogate key lookups. Alternatively, snapshot tables offer a simpler, reproducible model at the cost of higher data replication, making them ideal in cloud environments where storage is cheaper than engineering time.

  9. 9
    Article
    Avatar of tdsTowards Data Science·2y

    Data Modeling Techniques For Data Warehouse

    Data modeling is a key process in creating conceptual representations of organizational data and its relationships. Focusing on various methodologies like Kimball's, Inmon's, and Data Vault, this guide provides insights into dimensional modeling, including benefits like simplicity, improved query performance, and scalability. It also covers different schema types (star and snowflake), and strategies for data loading. Special attention is given to innovative approaches like using one big table (OBT) for modern data warehouses.

  10. 10
    Article
    Avatar of glwGolang Weekly·2y

    Golang Weekly Issue 514: July 9, 2024

    Learn how to locally patch dependencies in Go using `go mod` and discover tools for building scalable Go applications on AWS. This issue also covers the release of Ergo 2.14, a modern IRC server, and highlights CoreDNS for DNS and service discovery, as well as OmniParser for ETL tasks. Additional updates include releases like gocron 2.8, Go Micro 5.3, River 0.9, and go-arg 1.5.

  11. 11
    Article
    Avatar of motherduckMotherDuck·2y

    Performant dbt pipelines with MotherDuck

    This post recaps learnings from the dbt+MotherDuck workshop and delves into building performant data pipelines using DuckDB and MotherDuck. Key steps include utilizing the read_blob() function, leveraging pre_hooks and variables in DuckDB, implementing incremental models with read_csv(), and handling data de-duplication using unnest() and arg_max(). These techniques aim to optimize data workflows and enhance data transformation and analysis efficiency.

  12. 12
    Article
    Avatar of materializedviewMaterialized View·2y

    It's Time to Merge Analytics and Data Engineering (Again)

    The post argues for merging analytics and data engineering roles, citing the commoditization of data pipelines and the limited value provided by distinct analytics engineers. With advancements like LLMs, data integration tools, and data pipeline vendors, there's a push for a consolidated data team handling extraction, transformation, and loading (ETL) processes. The author notes emerging tools that facilitate this transition and predicts a convergence of these roles in the coming years.

  13. 13
    Article
    Avatar of detlifeData Engineer Things·2y

    End-to-End ETL and Sales Dashboard Project in Microsoft Fabric

    A step-by-step guide on creating a sales dashboard using Microsoft Fabric and PowerBI Desktop for the WideWorldImporters sample database. Key goals include creating a dynamic and user-friendly interface for monitoring sales performance across various dimensions like customer, product, and region. The post covers data gathering, ETL processes, creating views and tables, setting up a semantic model, and building various visuals to support different user stories.

  14. 14
    Article
    Avatar of ds_centralData Science Central·2y

    Reverse ETL in Healthcare- DataScienceCentral.com

    Managing patient data is a significant challenge in healthcare. Reverse ETL is a data integration method that ensures the smooth flow of data from data warehouses to operational systems like CRMs and ERPs. This real-time data synchronization improves patient care, enhances decision-making, maintains data consistency, ensures regulatory compliance, and enhances operational efficiency. Key components for successful implementation include a centralized data warehouse, robust ETL tools, seamless integration with operational systems, stringent data governance measures, and proper training for healthcare staff.

  15. 15
    Video
    Avatar of youtubeYouTube·2y

    Fundamentals Of Data Engineering Masterclass

    This Data Engineering masterclass covers the fundamentals of Data Engineering, including the life cycle, data generation, storage, database management, data modeling, and the distinction between SQL and NoSQL. It delves into data processing systems like OLTP and OLAP, ETL processes, and building data architecture from scratch. The session also explores data warehousing, dimensional modeling, data marts, data lakes, big data, cloud services (AWS, GCP, Azure), and key tools for data engineering such as Python, SQL, Apache Spark, Databricks, Apache Airflow, and Apache Kafka. Real-world architecture case studies on AWS and GCP are discussed as well.

  16. 16
    Article
    Avatar of mongodb_officialMongoDB_Official·1y

    AWS Glue Visual ETL for Your Data in MongoDB Atlas

    Learn how to use AWS Glue's visual ETL capabilities to transfer data between MongoDB Atlas and AWS S3. AWS Glue Studio allows developers to create ETL pipelines without needing knowledge of Spark or SQL, facilitating seamless data transformation and integration with other AWS services. AWS S3 is utilized for scalable, durable, and cost-effective data storage, making it suitable for data lakes, warehousing, machine learning, media streaming, backup, and web hosting.

  17. 17
    Article
    Avatar of databricksdatabricks·2y

    Accelerate Feature Engineering With Photon

    Training high-quality machine learning models involves careful data preparation, which can be time-consuming for large datasets. The Photon Engine, now available in Databricks Machine Learning Runtime, significantly speeds up Spark SQL and Spark DataFrame workloads, achieving speed improvements of 2x-4x. The Photon Engine enhances ETL processes and feature engineering, especially for time series data, using a new point-in-time join implementation. Users can enable Photon in Databricks ML Runtime 15.2 and above for better query performance.

  18. 18
    Article
    Avatar of tdsTowards Data Science·2y

    How to Pivot Tables in SQL

    A comprehensive guide on creating pivot tables in SQL for enhanced data analysis using DECODE() and PIVOT() functions.