Best of Data EngineeringOctober 2024

  1. 1
    Article
    Avatar of detlifeData Engineer Things·1y

    PoC Data Platform project utilizing modern data stack (Airflow, Spark, DBT, Trino, Hive metastore, Lightdash, Delta Lake)

    The PoC Data Platform demonstrates extracting, loading, and transforming data using modern data technologies like Airflow, Spark, DBT, Trino, Hive Metastore, Lightdash, and Delta Lake. It utilizes AdventureWorks data within a data lake environment and offers insights into configuring these tools for data engineering and system design. The platform provides a comprehensive Docker setup with detailed instructions, making it a valuable resource for both beginners and professionals in data systems.

  2. 2
    Article
    Avatar of swirlaiSwirlAI·1y

    Memory in Agent Systems

    The post explores the implementation and importance of memory in generative AI agent systems. It covers different memory types, including short-term and long-term memory, and their roles. Short-term memory provides context during interactions, while long-term memory, split into episodic, semantic, and procedural types, ensures continuity and relevance of information. The author emphasizes the necessity of efficient memory management in agentic architectures.

  3. 3
    Article
    Avatar of detlifeData Engineer Things·1y

    Data Pipeline Development with MinIO, Iceberg, Nessie, Polars, StarRocks, Mage, and Docker

    Explore how to build an efficient data pipeline without using Spark by leveraging technologies like MinIO, Iceberg, Nessie, Polars, StarRocks, Mage, and Docker. The pipeline uses the medallion architecture with Bronze, Silver, and Gold layers to ensure data quality and integrity through the Write-Audit-Publish (WAP) pattern. The post provides a detailed guide to setting up the necessary components, executing data transformation and quality checks, and using branching strategies with Project Nessie to manage data versions. Integration with Slack for alert notifications and catalog setup for querying data using StarRocks are also discussed.

  4. 4
    Article
    Avatar of detlifeData Engineer Things·1y

    The Ultimate Guide to CI/CD for Data Engineering in Databricks

    Implementing Continuous Integration and Continuous Deployment (CI/CD) for data engineering in Databricks involves unique challenges, such as the interdependence of code, data, and compute resources. Solutions include using Databricks' Git integration, Asset Bundles, and other tools for automating builds, tests, and deployments. Setting up CI/CD requires managing environments, code, data assets, and complex system integrations. Proper testing and handling of data state management are crucial for effective CI/CD pipelines in data engineering.

  5. 5
    Article
    Avatar of motherduckMotherDuck·2y

    Performant dbt pipelines with MotherDuck

    This post recaps learnings from the dbt+MotherDuck workshop and delves into building performant data pipelines using DuckDB and MotherDuck. Key steps include utilizing the read_blob() function, leveraging pre_hooks and variables in DuckDB, implementing incremental models with read_csv(), and handling data de-duplication using unnest() and arg_max(). These techniques aim to optimize data workflows and enhance data transformation and analysis efficiency.

  6. 6
    Article
    Avatar of detlifeData Engineer Things·1y

    Rethinking Data Layers: When Medallion Architecture Isn’t Enough

    Medallion Architecture's three-layer model (bronze, silver, gold) often falls short for large-scale businesses. Challenges like hidden crucial datasets and irregular updates necessitate more nuanced layers. Key considerations include granular pipeline tracking, regulatory compliance, data science needs, optimized reporting, data quality checks, and schema validation. Adapt layers based on organizational requirements, employing techniques such as raw data storage, schema validation, and data masking for better data integrity and security.

  7. 7
    Article
    Avatar of decuberssDecube·2y

    Understanding Data Products and Data Contracts: Building Trust in Modern Data Management

    Data products and data contracts transform raw data into reliable assets, helping organizations manage data quality and access control. Data products are curated and cleaned-up data sets designed to solve specific business problems. Data contracts are formal agreements that ensure data meets specified quality and update standards, fostering trust. Domain management organizes data by business function, enhancing order and security.

  8. 8
    Article
    Avatar of swirlaiSwirlAI·1y

    Observability in LLMOps pipeline - Different Levels of Scale

    Aurimas, the author of the SwirlAI Newsletter, discusses the increasing complexity and scale requirements of observability in LLMOps pipelines. He outlines the GenAI Value Chain, the stages from pre-training to GenAI Systems Engineering, and the challenges faced in tracking and observing different levels of AI systems, including RAG systems, agents, and multi-agent networks. The evolving nature of these systems demands more sophisticated observability tools, capable of handling big data analytics and complex, non-deterministic processes.

  9. 9
    Article
    Avatar of bigdataboutiqueBigData Boutique blog·1y

    Elasticsearch Performance and Cost Efficiency on Elastic Cloud and On-Prem

    Discover essential strategies to optimize Elasticsearch performance and cost efficiency for both Elastic Cloud and on-premises deployments. Key tactics include scaling up vs. scaling out, data tiering, continuous monitoring of critical metrics, efficient shard distribution, and advanced query optimization techniques. Participants in a recent webinar hosted by BigData Boutique and Elastic learned how to enhance their Elasticsearch setups for optimal performance and cost-effectiveness.

  10. 10
    Article
    Avatar of communityCommunity Picks·1y

    Data Engineering Blog

    Simon Späti, an experienced data engineer and technical writer, shares his vast knowledge on data engineering through his blog. Explore his insights, design patterns, and curated knowledge to push your understanding and expertise further. Stay updated by subscribing to his SELECT Insights newsletter.

  11. 11
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·2y

    Semi, Anti, and Natural Joins in DuckDB SQL

    The post introduces three additional types of SQL joins available in DuckDB: Semi Join, Anti Join, and Natural Join. It highlights the distinctions and use cases for each, such as Semi Join for checking record existence, Anti Join for finding non-matches, and Natural Join for a concise query without explicit join conditions. These joins offer more flexibility and elegance in SQL query writing.

  12. 12
    Article
    Avatar of medium_jsMedium·2y

    Graph RAG, Automated Prompt Engineering, Agent Frameworks, and Other September Must-Reads

    September brought a wave of exciting topics in ML and AI, showcasing a diversity of tutorials and guides. Highlights include guides on implementing Graph RAG, mastering key Python functions for data scientists, automated prompt engineering, and building AI agents using Python. Additionally, articles covered SQL essentials for data engineers, insights on choosing LLM agent frameworks, and innovative data visualization techniques.