Best of Data EngineeringJanuary 2025

  1. 1
    Article
    Avatar of detlifeData Engineer Things·1y

    End to End Data Engineering

    This post details the tools, technologies, and concepts essential for data engineering, emphasizing different paths for success based on roles and backgrounds. It highlights the importance of both analytics and infrastructure sides and mentions popular tools like Airflow and Snowflake. The significance of software engineering principles and specific data engineering roles is also discussed.

  2. 2
    Article
    Avatar of detlifeData Engineer Things·1y

    Apache Airflow Overview

    Apache Airflow, created at Airbnb in 2014 and now an open-source project under Apache, is a popular orchestration tool for managing complex data workflows. It operates using Directed Acyclic Graphs (DAGs) to define tasks and their dependencies. Core components include the Scheduler, Web Server, Metadata Database, and Workers. Airflow supports task concurrency, resource management, and integrations with external systems via operators and hooks. It offers various executors for task management, including SequentialExecutor, LocalExecutor, CeleryExecutor, and KubernetesExecutor. Deployment options range from single-machine setups to distributed and Kubernetes-based environments.

  3. 3
    Article
    Avatar of quastorQuastor Daily·1y

    How Pinterest Stores and Transfers Hundreds of Terabytes of Data Daily

    Pinterest uses a Change Data Capture (CDC) system to manage and transfer large volumes of data in real-time. This system helps keep their databases synchronized and improves performance by capturing and transferring only data changes. Pinterest's CDC architecture leverages open source tools like Debezium and Apache Kafka for scalability and reliability. The post also provides insights and practical tips for developers on technology selection, memory allocation, and cognitive psychology techniques to enhance coding skills.

  4. 4
    Article
    Avatar of detlifeData Engineer Things·1y

    Why I Love Python as Data Engineer

    Python is favored by data engineers for its versatility, simplicity, and rich library ecosystem. It excels in both small and large-scale data tasks, making data manipulation and automation easier. Despite some limitations like slower execution speed and memory consumption, its readable code and efficient debugging make it a preferred choice for many. Python integrates well with tools like Apache Spark and libraries for data visualization, adding to its appeal.

  5. 5
    Article
    Avatar of tigerabrodiTiger's Place·1y

    Data Loading Patterns (data integration)

    Discusses various data loading patterns for data integration, including full snapshot load, incremental load, delta load, and real-time updates. It explains the implementation techniques, key challenges, and use cases for each method, highlighting how they address different efficiency, history tracking, and immediacy requirements.

  6. 6
    Article
    Avatar of detlifeData Engineer Things·1y

    Netflix Movie Analytics (Homemade)

    A data engineer combines a passion for film with data analytics by analyzing their Netflix viewing habits. Using data exported from Netflix and enriched through The Movie Database (TMDB) API, they store and process the data on Google Cloud Platform (GCP). The data is modeled into a Star Schema on Google BigQuery, orchestrated with Airflow, and visualized using Tableau. Key insights include favorite genres, preferred viewing days, and overall streaming patterns.

  7. 7
    Article
    Avatar of swirlaiSwirlAI·1y

    Building AI Agents from scratch - Part 2: Reflection and Working Memory

    Learn about the Reflection pattern in AI agent systems, its relation to short-term memory, and how to implement an Agent class that utilizes Reflection to improve performance. This guide offers code examples, explains pros and cons, and showcases the connection between agent memory and Reflection capabilities. The practical example includes revising an action plan generated by an AI agent to fix hallucinations and improve response accuracy.