Best of Data EngineeringSeptember 2024

  1. 1
    Article
    Avatar of kdnuggetsKDnuggets·2y

    10 Built-In Python Modules Every Data Engineer Should Know

    Python's standard library includes built-in modules that are essential for data engineering tasks. Key modules like os, pathlib, shutil, csv, json, pickle, sqlite3, datetime, re, and subprocess enable efficient file and directory management, data handling and serialization, database interaction, text processing, and more. Utilizing these modules can streamline your data engineering workflows, providing essential functionality without relying on external libraries.

  2. 2
    Article
    Avatar of communityCommunity Picks·2y

    Top 5 Data Engineering Projects for Beginners 2024

    Get hands-on experience with five beginner-friendly data engineering projects that enhance skills in data processing, analytics, and visualization using tools like Apache Kafka, Python, and cloud platforms. These projects prepare you for real-world challenges and make your resume stand out.

  3. 3
    Article
    Avatar of hnHacker News·2y

    Data Engineering Vault

    The Data Engineering Vault is a comprehensive resource designed to help you explore and discover interconnected terms in data engineering. It covers the definition and evolution of data engineering, highlighting the importance of tools like Python, Apache Airflow, and SQL. Additionally, it offers resources for getting started with data engineering, including must-read articles, influential books, and valuable community insights.

  4. 4
    Article
    Avatar of tdsTowards Data Science·2y

    The “Who Does What” Guide To Enterprise Data Quality

    Effective data quality management in large organizations involves clearly defined roles and responsibilities across foundational and derived data products. Foundational products, managed by a central team, serve multiple use cases, while derived products are tailored for specific needs and owned by domain-specific teams. Key practices include end-to-end monitoring, business rule application, and efficient triage processes. Building trust through communication and data health measurement is also crucial.

  5. 5
    Article
    Avatar of detlifeData Engineer Things·2y

    I spent 8 hours diving deep into Snowflake (again)

    Snowflake, a prominent cloud data warehouse solution, was revisited in 2024 to re-examine its architecture and internal workings. The platform, known for separating computing and storage, relies on cloud services like Amazon S3, Google Cloud Storage, and Azure Blob Store for storage, and uses a shared-nothing engine for compute power. Snowflake's system includes Virtual Warehouses, columnar storage, vectorized execution, and various caching mechanisms. It also uses FoundationDB for its data catalog management and employs runtime adaptivity in its query optimizer.

  6. 6
    Video
    Avatar of seriousctoThe Serious CTO·2y

    Data Mesh: The Future of Data Engineering Explained

    Data Mesh redefines data architecture by decentralizing data management. Instead of centralizing all data in one big system, each department manages its own data, ensuring it's clean and accessible. This approach aims to eliminate bottlenecks, improve data quality, and foster better collaboration with shared standards across the company.

  7. 7
    Video
    Avatar of youtubeYouTube·2y

    Fundamentals Of Data Engineering Masterclass

    This Data Engineering masterclass covers the fundamentals of Data Engineering, including the life cycle, data generation, storage, database management, data modeling, and the distinction between SQL and NoSQL. It delves into data processing systems like OLTP and OLAP, ETL processes, and building data architecture from scratch. The session also explores data warehousing, dimensional modeling, data marts, data lakes, big data, cloud services (AWS, GCP, Azure), and key tools for data engineering such as Python, SQL, Apache Spark, Databricks, Apache Airflow, and Apache Kafka. Real-world architecture case studies on AWS and GCP are discussed as well.

  8. 8
    Article
    Avatar of confConfluent Blog·2y

    Inside the Kafka Black Box—How Producers Prepare Event Data for Brokers

    Apache Kafka is a robust distributed event streaming platform ideal for real-time data handling. This detailed guide explores the inner workings of Kafka, focusing on Kafka producers, consumers, and brokers. Key insights include the path data takes from producer to broker, essential configurations, partitioning strategies, batching techniques, and performance metrics to monitor. The aim is to equip developers with the knowledge needed to debug and optimize their Kafka applications.

  9. 9
    Article
    Avatar of hnHacker News·2y

    feldera/feldera: The Feldera Incremental Computation Engine

    Feldera is a high-performance incremental computation engine capable of incrementally evaluating arbitrary SQL programs. It efficiently processes inserts, updates, and deletes without recomputing older data and supports both live and historical data queries. The engine offers fast out-of-the-box performance, handles large datasets, guarantees consistency, and connects to various data sources. It's suitable for complex analytical tasks and feature engineering pipelines.

  10. 10
    Article
    Avatar of taiTowards AI·2y

    Journey From Data Warehouse To Lake To Lakehouse

    The post provides a fictional story to simplify the understanding of data storage concepts such as Data Warehouse, Data Lake, and Data Lakehouse. It highlights the evolution from the structured data storage of Data Warehouses, to the flexible, low-cost storage of Data Lakes, and finally to the comprehensive and efficient storage solutions of Data Lakehouses, which combine the benefits of both previous systems. Key concepts like schema-on-read and schema-on-write are explained, and top providers for each storage solution are recommended.

  11. 11
    Article
    Avatar of detlifeData Engineer Things·2y

    I spent 7 hours diving deep into Apache Iceberg

    This post delves into the internals of the Apache Iceberg file format, covering its data and metadata layers, manifest files, and how it manages read and write operations. It includes details on compaction, hidden partitioning, sorting, and row-level updates with both copy-on-write and merge-on-read modes. The goal is to offer a comprehensive understanding of Iceberg's capabilities and optimizations for managing large datasets efficiently.