Best of Data Engineering — December 2024

1
Article
Dev Genius·1y
Building a Python-Based Data Lake
Data lakes are vital for modern data ecosystems, allowing organizations to store and analyze large volumes of varied data without requiring a predefined schema. This guide details setting up a Python-based data lake using MinIO, PyIceberg, PyArrow, and Postgres, ideal for small to medium setups due to its simplicity. The step-by-step instructions cover installation of libraries, configuring SQL catalogs, data transformation using Pandas and PyArrow, and querying data. Advanced operations using DuckDB are also explored, showcasing robust data handling with flexibility and scalability.
134
3
2
Article
Machine Learning News·1y
Meet MegaParse: An Open-Source AI Tool for Parsing Various Types of Documents for LLM Ingestion
MegaParse is an open-source tool designed to efficiently parse various types of documents (PDF, Word, Excel, CSV, etc.) for ingestion into large language models (LLMs). It saves users significant time and effort by automating the conversion process while retaining information integrity. The tool is highly versatile, handling different document elements such as tables and images, and supports customizable output formats. Installation is straightforward via pip, with additional setups for dependencies like Poppler, Tesseract, and libmagic. MegaParse also provides advanced usage options and benchmarking capabilities, making it a reliable choice for developers and enterprises looking to streamline their AI data pipeline.
56
5
3
Article
Data Engineer Things·1y
ETL and ELT
The author reflects on their journey from chasing the latest data engineering tools to focusing on foundational concepts, emphasizing the shift from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform). The traditional ETL process, necessitated by the high costs and limitations of early data warehouses, is contrasted with the modern ELT approach, facilitated by advancements in cloud data warehousing. ELT offers greater flexibility and efficiency by loading raw data into the warehouse and handling transformations within the warehouse, aligning better with agile development practices.
45
4
4
Article
The New Stack·1y
5 Python Libraries Every Data Engineer Should Know
Python is a powerful language for data engineering, enhanced by essential third-party libraries. For beginners, Beautiful Soup 4 and Requests are ideal for web scraping and sending HTTP requests. Intermediate users may benefit from Apache Airflow for workflow automation and Boto3 for integrating AWS services. Advanced users can leverage Pandas for comprehensive data manipulation and analysis.
36
5
Article
Netflix TechBlog·1y
Cloud Efficiency at Netflix
Netflix uses AWS for its cloud infrastructure needs, leveraging a mix of open-source and proprietary solutions to run its platform efficiently. The Data & Insights organization collaborates with engineering teams to share key efficiency metrics, enabling informed business decisions. Netflix's Platform DSE team provides critical insights through the Foundational Platform Data (FPD) and Cloud Efficiency Analytics (CEA) components to help teams understand resource usage and cost. The organization aims for nearly complete cost insight coverage and plans to extend these solutions to other business areas, incorporating predictive analytics and machine learning for optimization.
25
6
Article
Metadata·1y
Stream Processing
Batch processes can delay business operations, so stream processing is used to handle events immediately as they occur. Stream processing involves systems notifying consumers of new events, often through message brokers like RabbitMQ or log-based brokers like Kafka. Dual writes can lead to errors and inconsistencies, so Change Data Capture (CDC) allows for consistent data replication across systems. Event sourcing records all changes immutably, aiding in auditability, recovery, and analytics. Stream processing can be used in various applications, including fraud detection, trading systems, and manufacturing, and relies on techniques like microbatching and checkpointing for fault tolerance.
24
7
Article
Martin Fowler·1y
Designing data products
The post discusses a methodical approach to designing data products by working backwards from use cases. It outlines the characteristics of effective data products and differentiates them from data-driven applications. The approach focuses on avoiding overdesign and ensuring data products are discoverable, addressable, understandable, trustworthy, natively accessible, interoperable, valuable on their own, and secure. A real-world example in fashion retail is provided to illustrate the process.
15
8
Article
Data Engineer Things·1y
Building Machine Learning Pipelines with the FTI Architecture: A Practical Step-by-Step Guide
FTI (Feature, Training, Inference) architecture offers a modular and scalable framework for building machine learning pipelines. It divides the workflow into three independent stages: Feature Pipeline, Training Pipeline, and Inference Pipeline. This approach ensures modularity, reusability, consistency, scalability, and reproducibility. The Feature Pipeline transforms raw data into engineered features, the Training Pipeline manages the model's lifecycle, and the Inference Pipeline serves real-time or batch predictions using the trained model.
14
9
Article
databricks·1y
Simplify Data Ingestion With the New Python Data Source API
Spark's new Python Data Source API addresses the challenges faced by data engineers in integrating diverse data sources, particularly in IoT applications. By providing abstract classes and object-oriented concepts, the API simplifies the ingestion of data from REST APIs and other sources. The example with Shell demonstrates how this API allows for a modular and reusable approach, enhancing productivity and promoting collaboration. The API supports both batch and streaming contexts, enabling efficient data integration across various use cases.
13
10
Article
Tinybird·1y
Building an Insights page for your SaaS: from idea to production
A Data Engineer shares steps for embedding an Insights page into a SaaS application, using Tinybird for analytics. The guide covers understanding user needs, creating data sources and APIs, prototyping and testing, optimizing for scale, and monitoring the project. It emphasizes starting simple, optimizing before production, and continuous monitoring.
13
11
Article
Data Engineer Things·1y
Predictions for Data Engineering in 2025 Based on 200+ Hours of Content Review
To succeed in data engineering by 2025, engineers need to leverage low-cost storage solutions like object stores, enhance critical thinking skills to complement AI tools, and understand the business context to drive real value. Pairing technical abilities with business acumen will be the key to creating impactful solutions.
11

See all Data Engineering archives