Best of Data Engineering — 2024

1
Article
Medium·2y
How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?
LinkedIn uses Apache Kafka to manage and process up to 7 trillion messages daily. They achieve reliability and scalability through a multi-tiered Kafka deployment across multiple data centers, leveraging local and aggregate clusters. LinkedIn ensures message completeness with an internal auditing tool that tracks sent and consumed messages. They maintain a close relationship with the open-source Kafka community by regularly integrating features and patches from their internal branches into the upstream Kafka branch.
175
4
2
Article
Medium·2y
Roadmap to Learn Data Engineering: How I Would Start Again
A roadmap for learning data engineering, covering Python, SQL, command line, data warehouse, data modeling, data storage, data processing, data transformation, data orchestration, advanced topics, and staying updated.
159
6
3
Article
Tinybird·2y
How to choose the right type of database
Understanding the different types of databases, factors to consider when choosing a database, and the implications of the CAP theorem on database selection.
159
12
4
Article
KDnuggets·2y
5 Tips for Improving SQL Query Performance
Strong SQL skills are crucial in data roles, where optimizing query performance can significantly impact application efficiency. Key tips include avoiding SELECT * by specifying columns, using GROUP BY instead of SELECT DISTINCT, limiting query results, and employing indexes with caution. Balancing these techniques can improve query performance and ensure efficient database operations.
149
3
5
Article
Community Picks·2y
The State of Data Engineering 2024
The 2024 State of Data Engineering report discusses the influence of GenAI on software infrastructure, the expansion of product offerings due to the economic downturn, and the impact of open table formats and their catalogs in the data lake industry. It also highlights the importance of data version control and observability in AI/ML systems.
144
3
6
Article
Dev Genius·1y
Building a Python-Based Data Lake
Data lakes are vital for modern data ecosystems, allowing organizations to store and analyze large volumes of varied data without requiring a predefined schema. This guide details setting up a Python-based data lake using MinIO, PyIceberg, PyArrow, and Postgres, ideal for small to medium setups due to its simplicity. The step-by-step instructions cover installation of libraries, configuring SQL catalogs, data transformation using Pandas and PyArrow, and querying data. Advanced operations using DuckDB are also explored, showcasing robust data handling with flexibility and scalability.
134
3
7
Article
KDnuggets·2y
10 Built-In Python Modules Every Data Engineer Should Know
Python's standard library includes built-in modules that are essential for data engineering tasks. Key modules like os, pathlib, shutil, csv, json, pickle, sqlite3, datetime, re, and subprocess enable efficient file and directory management, data handling and serialization, database interaction, text processing, and more. Utilizing these modules can streamline your data engineering workflows, providing essential functionality without relying on external libraries.
110
8
Article
KDnuggets·2y
10 GitHub Repositories to Master Data Engineering
Learn data engineering through free courses, tutorials, books, tools, guides, roadmaps, practice exercises, projects, and other resources.
107
4
9
Article
Towards Dev·2y
Building a Serverless Data Pipeline: A Step-by-Step Guide
The guide provides step-by-step instructions to build a serverless data pipeline using AWS services. Key components include AWS Lambda for data extraction from the Colombo Stock Market Index API, Amazon Kinesis Data Firehose for data ingestion, Amazon S3 for storage, and AWS Glue for ETL orchestration with Athena for querying data. The pipeline uses event-driven architectures with SQS notifications and Glue crawlers for efficient data processing.
81
1
10
Article
SwirlAI·1y
What is AI Engineering?
AI Engineering is a rapidly evolving role focused on developing and deploying AI systems that utilize Large Language Models (LLMs) to solve business problems. AI Engineers differ from Software Engineers and Machine Learning Engineers in that they deal extensively with non-deterministic systems and require skills in prompt engineering, infrastructure, and data integration. The field is witnessing the rise of Agentic systems, which are advanced AI systems capable of performing complex tasks with a degree of autonomy. AI Engineering is poised to become one of the most in-demand roles in the tech industry with high salaries and growing opportunities.
78
2
11
Article
KDnuggets·2y
Project Ideas to Master Data Engineering
To effectively learn data engineering, working on projects is essential. Key skills to focus on include data transformation, data visualization, building data pipelines, and implementing data storage solutions like data lakes and data warehouses. The post suggests six project ideas to cover these aspects: building an end-to-end data pipeline, transforming data sets, implementing a data lake, creating a data warehouse, processing real-time data, and visualizing data with dashboards.
68
12
Article
Machine Learning News·1y
Meet MegaParse: An Open-Source AI Tool for Parsing Various Types of Documents for LLM Ingestion
MegaParse is an open-source tool designed to efficiently parse various types of documents (PDF, Word, Excel, CSV, etc.) for ingestion into large language models (LLMs). It saves users significant time and effort by automating the conversion process while retaining information integrity. The tool is highly versatile, handling different document elements such as tables and images, and supports customizable output formats. Installation is straightforward via pip, with additional setups for dependencies like Poppler, Tesseract, and libmagic. MegaParse also provides advanced usage options and benchmarking capabilities, making it a reliable choice for developers and enterprises looking to streamline their AI data pipeline.
56
5
13
Article
ByteByteGo·2y
Trillions of Indexes: How Uber’s LedgerStore Supports Such Massive Scale
Uber's LedgerStore is a custom-built solution to manage trillions of financial transaction records efficiently. It ensures data immutability and supports various types of indexes including strongly consistent, eventually consistent, and time-range indexes. The migration from DynamoDB to LedgerStore for Uber's payment data was driven by the need for cost savings, simplified architecture, improved performance, and tailored features for financial data management. This transition involved handling 1.2 PB of compressed data with zero data inconsistencies detected over six months.
55
3
14
Article
KDnuggets·2y
7 Python Libraries Every Data Engineer Should Know
Discover some essential Python libraries for data engineers, including Requests for API data extraction, BeautifulSoup for web scraping, Pandas for data manipulation, SQLAlchemy for database work, Airflow for workflow orchestration, PySpark for big data processing, and Kafka-Python for real-time data processing.
52
15
Article
Medium·2y
The Most Important Soft Skill in Tech
Interviewing is the most important soft skill in tech. Building rapport, demonstrating value, and projecting professional growth are key objectives in an interview.
52
16
Article
Community Picks·2y
Top 5 Data Engineering Projects for Beginners 2024
Get hands-on experience with five beginner-friendly data engineering projects that enhance skills in data processing, analytics, and visualization using tools like Apache Kafka, Python, and cloud platforms. These projects prepare you for real-world challenges and make your resume stand out.
49
17
Article
Machine Learning News·2y
OmniParse: An AI Platform that Ingests/Parses Any Unstructured Data into Structured, Actionable Data Optimized for GenAI (LLM) Applications
OmniParse is an AI platform designed to convert various unstructured data types, including documents, images, audio, video, and web content, into structured, actionable data. It supports around 20 different file types and operates entirely locally, ensuring data privacy. OmniParse deploys easily using Docker and Skypilot and works with platforms like Colab. It uses advanced models such as Surya OCR and Whisper, achieving high accuracy and efficiency in data conversion, optimizing it for Generative AI applications.
48
18
Article
KDnuggets·2y
5 Free Online Courses to Learn Data Engineering Fundamentals
Explore five free online courses designed to teach the fundamentals of data engineering. These courses range from beginner-friendly introductions to comprehensive professional certificates. Key areas covered include data pipelines, databases, Python and Pandas, cloud computing, and big data tools like Hadoop and Apache Spark.
46
1
19
Article
Machine Learning News·2y
Top Data Engineering Courses in 2024
Data engineering is crucial for organizations relying on data-driven insights. This post lists top courses for mastering data engineering skills such as building scalable data solutions, ETL processes, and leveraging technologies like Apache Spark and cloud platforms. Courses include IBM’s Data Engineering Foundations, Meta Database Engineer Professional Certificate, and Google Cloud Database Engineer Specialization, among others.
46
20
Article
Data Engineer Things·1y
ETL and ELT
The author reflects on their journey from chasing the latest data engineering tools to focusing on foundational concepts, emphasizing the shift from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform). The traditional ETL process, necessitated by the high costs and limitations of early data warehouses, is contrasted with the modern ELT approach, facilitated by advancements in cloud data warehousing. ELT offers greater flexibility and efficiency by loading raw data into the warehouse and handling transformations within the warehouse, aligning better with agile development practices.
45
4
21
Article
Data Engineer Things·2y
PoC Data Platform project utilizing modern data stack (Airflow, Spark, DBT, Trino, Hive metastore, Lightdash, Delta Lake)
The PoC Data Platform demonstrates extracting, loading, and transforming data using modern data technologies like Airflow, Spark, DBT, Trino, Hive Metastore, Lightdash, and Delta Lake. It utilizes AdventureWorks data within a data lake environment and offers insights into configuring these tools for data engineering and system design. The platform provides a comprehensive Docker setup with detailed instructions, making it a valuable resource for both beginners and professionals in data systems.
40
1
22
Article
KDnuggets·2y
Landing a Data Engineer Role: Free Courses and Certifications
Training for a data engineer role doesn't have to be expensive. A curated list of 10 free data engineering courses offers quality education at no cost. Courses cover key areas such as SQL, Python, cloud data engineering, ETL and data pipelines, data warehousing, and Apache Spark. Many courses are provided by edX, and some require prior knowledge in SQL and relational databases. The article encourages that with dedication and persistence, one can achieve their data engineering goals through these free resources.
40
23
Article
Towards AI·2y
SQL Interview Problem — Solution.
The post provides a step-by-step solution to an SQL interview problem where the task is to determine the second highest employee-manager pair average salary. It details how to observe the expected output, identify conditions like the Employee-Manager pair, use self-join to fetch necessary data, calculate average salaries, and assign rankings to filter for the needed result.
39
2
24
Article
Community Picks·2y
Bridging Backend and Data Engineering: Communicating Through Events
In modern software development, seamless communication between backend services and data engineering pipelines is crucial. Traditional methods like REST APIs and batch processing often fall short for real-time demands. An event-driven architecture (EDA) offers a solution by using asynchronous event communication, enabling integration of diverse systems. A practical approach is setting up a Pub/Sub system where services broadcast and consume events via standardized formats. This method allows for selective event subscription and facilitates efficient asynchronous communication without overhauling the infrastructure.
38
25
Article
The New Stack·1y
5 Python Libraries Every Data Engineer Should Know
Python is a powerful language for data engineering, enhanced by essential third-party libraries. For beginners, Beautiful Soup 4 and Requests are ideal for web scraping and sending HTTP requests. Intermediate users may benefit from Apache Airflow for workflow automation and Boto3 for integrating AWS services. Advanced users can leverage Pandas for comprehensive data manipulation and analysis.
36

See all Data Engineering archives