Best of Data Engineering — November 2024

1
Article
SwirlAI·1y
What is AI Engineering?
AI Engineering is a rapidly evolving role focused on developing and deploying AI systems that utilize Large Language Models (LLMs) to solve business problems. AI Engineers differ from Software Engineers and Machine Learning Engineers in that they deal extensively with non-deterministic systems and require skills in prompt engineering, infrastructure, and data integration. The field is witnessing the rise of Agentic systems, which are advanced AI systems capable of performing complex tasks with a degree of autonomy. AI Engineering is poised to become one of the most in-demand roles in the tech industry with high salaries and growing opportunities.
78
2
2
Article
Netflix TechBlog·2y
Netflix’s Distributed Counter Abstraction
Netflix's Distributed Counter Abstraction is a high-performance, scalable counting service built on top of their TimeSeries Abstraction. It supports two primary counting modes—Best-Effort and Eventually Consistent—to cater to different use cases and trade-offs involving accuracy, latency, and infrastructure costs. The service aims to handle high throughput and availability by leveraging a combination of caching, durable queuing, and periodic aggregation mechanisms. Additionally, it incorporates various approaches to manage potential data loss, idempotency, and contention issues inherent in distributed systems.
35
3
Article
ByteByteGo·2y
Storing 200 Billion Entities: Notion’s Data Lake Project
Notion experienced a 10-fold increase in data growth since 2021, reaching over 200 billion blocks stored in their Postgres database by 2024. This exponential growth led to the development of a new data lake infrastructure to manage the heavy load and improve scalability, performance, and cost-efficiency. The new setup includes using S3 for storage, Kafka for data ingestion, and Apache Hudi for managing updates. This overhaul has resulted in significant cost savings, reduced ingestion times, and enabled new features such as Notion AI.
33
4
Article
Data Engineer Things·1y
How does Netflix ensure the data quality for thousands of Apache Iceberg tables?
Netflix employs the Write-Audit-Publish (WAP) pattern using Apache Iceberg to maintain high data quality across thousands of tables. The WAP pattern involves writing data to a hidden snapshot, auditing it, and publishing it only if it passes quality checks. This approach is analogous to CI/CD workflows, ensuring validated data is exposed to downstream consumers. Apache Iceberg's structure, including manifest files, metadata files, and catalog, supports efficient snapshot management and branching, similar to version control in Git.
23
5
Article
Data Engineer Things·2y
Excel Isn’t Going Anywhere, So Let’s Automate Parsing It
Automating Excel file parsing with Python and Pandas can significantly improve efficiency, consistency, and scalability in handling messy, manually filled Excel files. This guide provides a step-by-step process to read and extract specific table data, handle issues, and alert stakeholders about any problems encountered during parsing.
21
6
Article
Data Engineer Things·1y
Cloud-Native Data Engineering: Orchestrating Spark on Kubernetes with Custom Airflow Operator and GCS Integration
This guide provides step-by-step instructions for setting up a scalable, automated data pipeline using Spark on Kubernetes with Google Cloud Storage (GCS) integration, managed by Apache Airflow. It includes configuring custom Docker images for Spark with GCS support, installing and configuring the Spark Operator and Airflow, creating a custom Airflow operator for submitting Spark jobs, and setting up necessary role-based access controls. By the end, readers will have a robust, cloud-native data engineering platform capable of handling complex data workflows efficiently.
16
7
Article
agoda·2y
A Day in the Life of a Data Engineer at Agoda
At Agoda, data fuels every decision and data engineers play a key role by designing and maintaining data pipelines. Lookuut Struchkov, a Staff Data Engineer, discusses his journey and daily responsibilities, including optimizing data pipelines, collaborating with various teams, and handling on-call support requests. He emphasizes the importance of continuous learning and shares advice for aspiring data engineers. Agoda stands out for its commitment to innovation, use of cutting-edge technologies like Spark and Scala, and its collaborative culture.
14
1
8
Article
Materialized View·2y
It's Time to Merge Analytics and Data Engineering (Again)
The post argues for merging analytics and data engineering roles, citing the commoditization of data pipelines and the limited value provided by distinct analytics engineers. With advancements like LLMs, data integration tools, and data pipeline vendors, there's a push for a consolidated data team handling extraction, transformation, and loading (ETL) processes. The author notes emerging tools that facilitate this transition and predicts a convergence of these roles in the coming years.
14
9
Article
Crunchy Data·2y
8 Steps in Writing Analytical SQL Queries
Writing complex SQL queries involves starting with simple queries and progressively adding complexity while verifying accuracy at each step. Key steps include defining desired data, investigating and sampling data, confirming simplicity, adding joins cautiously, performing summations, and rigorously debugging. SQL's power lies in its ability to utilize simple, standardized logic blocks to extract accurate data from complex structures.
14
10
Video
YouTube·2y
6-week Free Data Engineering Boot Camp Launch Video | DataExpert.io
A six-week free Data Engineering Boot Camp is launching, featuring over 45 in-depth videos aimed at elevating skills in data engineering. The first two weeks cover dimensional and fact data modeling. Afterward, the boot camp splits into infrastructure and analytics tracks, addressing topics such as pipeline specs, data quality patterns, PySpark unit testing, Kafka, real-time data processing, and more. Participants can earn certificates and get hands-on assignments with AI-generated feedback, with content published daily from November 15th until the end of the year.
13
11
Article
Data Engineer Things·2y
Understanding Data Products and Data Contracts
Data products and data contracts are essential tools for transforming raw data into valuable assets. Data products are curated datasets crafted to solve specific business problems, while data contracts are formal agreements ensuring data quality and reliability between producers and consumers. These concepts help organizations manage data efficiently, foster trust, and drive innovation by defining clear standards and processes for data handling and access control.
12
12
Article
Tech World With Milan·2y
The Trends #5: 25% of new code is generated by AI
AI technologies are increasingly integrated into workflows, with over 25% of new code at Google being generated by AI. Organizations face challenges in cloud cost management, including high idle resource costs and underutilization of cloud discounts. APIs are evolving as strategic revenue drivers, and there is a shift towards open-source AI models, AI-integrated hardware, and small language models. New programming languages and platforms like Rust and WASM are gaining traction. The ThoughtWorks Technology Radar highlights the rise of generative AI tools and common coding assistance anti-patterns.
10
1

See all Data Engineering archives