Best of Data EngineeringMay 2025

  1. 1
    Article
    Avatar of materializedviewMaterialized View·46w

    Kafka: The End of the Beginning

    Apache Kafka has dominated streaming data for over a decade, but innovation has stagnated while batch processing has evolved rapidly. The streaming ecosystem faces challenges with slow growth, long sales cycles, and lack of new ideas. While Kafka's protocol has become the de facto standard, its architecture shows limitations for modern cloud-native requirements. New solutions like S2 are emerging with fresh approaches, and the next decade could see a transition similar to how batch processing moved beyond Hadoop, potentially ushering in a truly cloud-native streaming era.

  2. 2
    Article
    Avatar of databricksdatabricks·46w

    Introducing Apache Spark 4.0

    Apache Spark 4.0 introduces key advancements in SQL language, Python support, structured streaming, and usability, enhancing big data processing. Notable features include improved multi-language compatibility, new SQL scripting capabilities, enhanced Python APIs, and structured logging. This release offers greater modularity, scalability, and standards compliance, making it future-ready for large-scale data analytics.

  3. 3
    Article
    Avatar of detlifeData Engineer Things·48w

    Airflow 3 and Airflow AI SDK in Action — Analyzing League of Legends

    This post demonstrates how to create an end-to-end data pipeline using Airflow 3 and the Airflow AI SDK to analyze League of Legends data. It covers setting up the environment, exploring the Riot Games API, building a Python client for API interaction, and using AI to generate a champion tier list. The pipeline showcases modern Airflow features like Dynamic Task Mapping and emphasizes newer AI integration capabilities with Large Language Models.

  4. 4
    Article
    Avatar of detlifeData Engineer Things·46w

    From GIS to Data Engineering: Mastering Docker Fundamentals and Best Practices

    The post details a geospatial professional's transition into data engineering by mastering Docker fundamentals and best practices. It covers key aspects such as Docker setup, container security, resource management, and the use of Docker Compose for production-ready environments. It also highlights the importance of secure configuration and iteration in system design, using real-world examples of data pipeline implementation and containerization strategies.

  5. 5
    Article
    Avatar of singlestoreSingleStore·47w

    5 Signs Your PostgreSQL Database Is Hitting Its Performance Limits

    PostgreSQL is a powerful relational database system, but can face performance issues as workloads increase. Common signs include slow query performance, lock contention, data ingestion struggles, the need to frequently archive data, and diminishing hardware upgrade returns. SingleStore offers a modern, distributed architecture that enhances real-time analytics, reduces lock contention, supports high-throughput data ingestion, handles large data volumes efficiently, and scales horizontally for better performance and cost efficiency.

  6. 6
    Article
    Avatar of detlifeData Engineer Things·46w

    Building ETL pipeline using Google Cloud Storage

    The post provides a guide on creating a simple ETL pipeline using Google Cloud Storage to process Zomato restaurant data from Kaggle. It involves extracting, transforming, and loading the data using Python and Google Cloud Storage, offering insights suitable for beginners in data engineering. Key improvements include automation, extension to other cloud services, dashboarding, and data validation.