Best of Data Engineering — May 2025

1
Article
Materialized View·51w
Kafka: The End of the Beginning
Apache Kafka has dominated streaming data for over a decade, but innovation has stagnated while batch processing has evolved rapidly. The streaming ecosystem faces challenges with slow growth, long sales cycles, and lack of new ideas. While Kafka's protocol has become the de facto standard, its architecture shows limitations for modern cloud-native requirements. New solutions like S2 are emerging with fresh approaches, and the next decade could see a transition similar to how batch processing moved beyond Hadoop, potentially ushering in a truly cloud-native streaming era.
293
6
2
Article
databricks·52w
Introducing Apache Spark 4.0
Apache Spark 4.0 introduces key advancements in SQL language, Python support, structured streaming, and usability, enhancing big data processing. Notable features include improved multi-language compatibility, new SQL scripting capabilities, enhanced Python APIs, and structured logging. This release offers greater modularity, scalability, and standards compliance, making it future-ready for large-scale data analytics.
43
3
Article
Data Engineer Things·1y
Airflow 3 and Airflow AI SDK in Action — Analyzing League of Legends
This post demonstrates how to create an end-to-end data pipeline using Airflow 3 and the Airflow AI SDK to analyze League of Legends data. It covers setting up the environment, exploring the Riot Games API, building a Python client for API interaction, and using AI to generate a champion tier list. The pipeline showcases modern Airflow features like Dynamic Task Mapping and emphasizes newer AI integration capabilities with Large Language Models.
21
1
4
Article
Data Engineer Things·52w
From GIS to Data Engineering: Mastering Docker Fundamentals and Best Practices
The post details a geospatial professional's transition into data engineering by mastering Docker fundamentals and best practices. It covers key aspects such as Docker setup, container security, resource management, and the use of Docker Compose for production-ready environments. It also highlights the importance of secure configuration and iteration in system design, using real-world examples of data pipeline implementation and containerization strategies.
21
5
Article
SingleStore·1y
5 Signs Your PostgreSQL Database Is Hitting Its Performance Limits
PostgreSQL is a powerful relational database system, but can face performance issues as workloads increase. Common signs include slow query performance, lock contention, data ingestion struggles, the need to frequently archive data, and diminishing hardware upgrade returns. SingleStore offers a modern, distributed architecture that enhances real-time analytics, reduces lock contention, supports high-throughput data ingestion, handles large data volumes efficiently, and scales horizontally for better performance and cost efficiency.
18
6
Article
Data Engineer Things·52w
Building ETL pipeline using Google Cloud Storage
The post provides a guide on creating a simple ETL pipeline using Google Cloud Storage to process Zomato restaurant data from Kaggle. It involves extracting, transforming, and loading the data using Python and Google Cloud Storage, offering insights suitable for beginners in data engineering. Key improvements include automation, extension to other cloud services, dashboarding, and data validation.
10

See all Data Engineering archives