Best of Data Engineering — October 2024

1
Article
Data Engineer Things·2y
PoC Data Platform project utilizing modern data stack (Airflow, Spark, DBT, Trino, Hive metastore, Lightdash, Delta Lake)
The PoC Data Platform demonstrates extracting, loading, and transforming data using modern data technologies like Airflow, Spark, DBT, Trino, Hive Metastore, Lightdash, and Delta Lake. It utilizes AdventureWorks data within a data lake environment and offers insights into configuring these tools for data engineering and system design. The platform provides a comprehensive Docker setup with detailed instructions, making it a valuable resource for both beginners and professionals in data systems.
40
1
2
Article
SwirlAI·2y
Memory in Agent Systems
The post explores the implementation and importance of memory in generative AI agent systems. It covers different memory types, including short-term and long-term memory, and their roles. Short-term memory provides context during interactions, while long-term memory, split into episodic, semantic, and procedural types, ensures continuity and relevance of information. The author emphasizes the necessity of efficient memory management in agentic architectures.
20
3
Article
Data Engineer Things·2y
Data Pipeline Development with MinIO, Iceberg, Nessie, Polars, StarRocks, Mage, and Docker
Explore how to build an efficient data pipeline without using Spark by leveraging technologies like MinIO, Iceberg, Nessie, Polars, StarRocks, Mage, and Docker. The pipeline uses the medallion architecture with Bronze, Silver, and Gold layers to ensure data quality and integrity through the Write-Audit-Publish (WAP) pattern. The post provides a detailed guide to setting up the necessary components, executing data transformation and quality checks, and using branching strategies with Project Nessie to manage data versions. Integration with Slack for alert notifications and catalog setup for querying data using StarRocks are also discussed.
17
4
Article
Data Engineer Things·2y
The Ultimate Guide to CI/CD for Data Engineering in Databricks
Implementing Continuous Integration and Continuous Deployment (CI/CD) for data engineering in Databricks involves unique challenges, such as the interdependence of code, data, and compute resources. Solutions include using Databricks' Git integration, Asset Bundles, and other tools for automating builds, tests, and deployments. Setting up CI/CD requires managing environments, code, data assets, and complex system integrations. Proper testing and handling of data state management are crucial for effective CI/CD pipelines in data engineering.
16
5
Article
MotherDuck·2y
Performant dbt pipelines with MotherDuck
This post recaps learnings from the dbt+MotherDuck workshop and delves into building performant data pipelines using DuckDB and MotherDuck. Key steps include utilizing the read_blob() function, leveraging pre_hooks and variables in DuckDB, implementing incremental models with read_csv(), and handling data de-duplication using unnest() and arg_max(). These techniques aim to optimize data workflows and enhance data transformation and analysis efficiency.
16
1
6
Article
Data Engineer Things·2y
Rethinking Data Layers: When Medallion Architecture Isn’t Enough
Medallion Architecture's three-layer model (bronze, silver, gold) often falls short for large-scale businesses. Challenges like hidden crucial datasets and irregular updates necessitate more nuanced layers. Key considerations include granular pipeline tracking, regulatory compliance, data science needs, optimized reporting, data quality checks, and schema validation. Adapt layers based on organizational requirements, employing techniques such as raw data storage, schema validation, and data masking for better data integrity and security.
14
7
Article
Decube·2y
Understanding Data Products and Data Contracts: Building Trust in Modern Data Management
Data products and data contracts transform raw data into reliable assets, helping organizations manage data quality and access control. Data products are curated and cleaned-up data sets designed to solve specific business problems. Data contracts are formal agreements that ensure data meets specified quality and update standards, fostering trust. Domain management organizes data by business function, enhancing order and security.
14
8
Article
SwirlAI·2y
Observability in LLMOps pipeline - Different Levels of Scale
Aurimas, the author of the SwirlAI Newsletter, discusses the increasing complexity and scale requirements of observability in LLMOps pipelines. He outlines the GenAI Value Chain, the stages from pre-training to GenAI Systems Engineering, and the challenges faced in tracking and observing different levels of AI systems, including RAG systems, agents, and multi-agent networks. The evolving nature of these systems demands more sophisticated observability tools, capable of handling big data analytics and complex, non-deterministic processes.
13
9
Article
BigData Boutique blog·2y
Elasticsearch Performance and Cost Efficiency on Elastic Cloud and On-Prem
Discover essential strategies to optimize Elasticsearch performance and cost efficiency for both Elastic Cloud and on-premises deployments. Key tactics include scaling up vs. scaling out, data tiering, continuous monitoring of critical metrics, efficient shard distribution, and advanced query optimization techniques. Participants in a recent webinar hosted by BigData Boutique and Elastic learned how to enhance their Elasticsearch setups for optimal performance and cost-effectiveness.
12
10
Article
Community Picks·2y
Data Engineering Blog
Simon Späti, an experienced data engineer and technical writer, shares his vast knowledge on data engineering through his blog. Explore his insights, design patterns, and curated knowledge to push your understanding and expertise further. Stay updated by subscribing to his SELECT Insights newsletter.
11
11
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
Semi, Anti, and Natural Joins in DuckDB SQL
The post introduces three additional types of SQL joins available in DuckDB: Semi Join, Anti Join, and Natural Join. It highlights the distinctions and use cases for each, such as Semi Join for checking record existence, Anti Join for finding non-matches, and Natural Join for a concise query without explicit join conditions. These joins offer more flexibility and elegance in SQL query writing.
11
12
Article
Medium·2y
Graph RAG, Automated Prompt Engineering, Agent Frameworks, and Other September Must-Reads
September brought a wave of exciting topics in ML and AI, showcasing a diversity of tutorials and guides. Highlights include guides on implementing Graph RAG, mastering key Python functions for data scientists, automated prompt engineering, and building AI agents using Python. Additionally, articles covered SQL essentials for data engineers, insights on choosing LLM agent frameworks, and innovative data visualization techniques.
10

See all Data Engineering archives