Best of Data EngineeringFebruary 2025

  1. 1
    Article
    Avatar of swirlaiSwirlAI·1y

    Data Pipelines in Machine Learning Systems.

    This tutorial guides through implementing a real-time data ingestion pipeline for machine learning systems using FastAPI and Apache Spark. Key steps include writing a FastAPI collector application, downloading and pushing data from the internet to this application, and processing the data via a Spark ETL pipeline managed by Airflow, all deployed on the Nebius AI Cloud platform. The tutorial emphasizes ensuring data quality and integrity at each stage and showcases setting up Kubernetes clusters for high availability and managed data operations.

  2. 2
    Article
    Avatar of detlifeData Engineer Things·1y

    Its time to try Kestra

    Kestra is presented as an underrated yet powerful workflow orchestrator, boasting a user-friendly UI, YAML-based workflows, comprehensive documentation, and impressive scalability and performance. While it faces challenges such as being relatively new, having a smaller community, and some limitations in advanced features, Kestra’s simplicity and efficiency make it a promising tool for the future of data team workflow orchestration.

  3. 3
    Article
    Avatar of taiTowards AI·1y

    End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker

    The post provides a detailed guide on building an end-to-end data engineering system using Kafka for data streaming, Spark for data transformation, Airflow for orchestration, PostgreSQL for storage, and Docker for setup and deployment. It is structured into two phases: the first focuses on constructing the data pipeline, while the second will cover creating an application to interact with the database using language models. This project is particularly suited for beginners to data engineering, aiming to deepen their practical knowledge of handling data systems.

  4. 4
    Article
    Avatar of swirlaiSwirlAI·1y

    Simple way to explain Memory in AI Agents.

    SwirlAI is partnering with NVIDIA to give away an NVIDIA RTX 4080 SUPER GPU. To enter, register for the GTC 2025 conference, which is free and runs from March 17-21 both in San Jose, CA and virtually. Highlights include sessions on humanoid robots, generative AI for edge applications, and advancements in European robotics. The post also explains four types of memory in AI agents: episodic, semantic, procedural, and short-term (working) memory.

  5. 5
    Article
    Avatar of neo4jneo4j·1y

    LLM Knowledge Graph Builder — First Release of 2025

    The LLM Knowledge Graph Builder enhances retrieval-augmented generation (RAG) by transforming unstructured data into a structured knowledge graph. It imports documents, splits them into chunks, generates text embeddings, and uses various language models to extract entities and their relationships. The latest update introduces several features such as community summaries, parallel retrievers, and expanded model support, improving user experience and data interaction. The tool supports multiple retrievers running in parallel, guided extraction instructions, and includes metrics for retriever evaluation.

  6. 6
    Article
    Avatar of tinybirdTinybird·1y

    Ship data as you ship code: Tinybird is local-first.

    Tinybird is transitioning to a local-first workflow to simplify working with large amounts of real-time data. The new approach allows developers to build, test, validate, and deploy data projects locally before pushing changes to the cloud. Key features include local project validation, seamless CI/CD integration, live schema migrations, and AI-powered IDE support. The beta version will be available soon.

  7. 7
    Article
    Avatar of debeziumDebezium·1y

    Real-time Data Replication with Debezium and Python

    Change Data Capture (CDC) is essential for replicating operational data for analytics, and Debezium is a leading tool in this space, connecting to various databases and exporting CDC events in formats like JSON and Avro. This post demonstrates how to implement a Python-powered CDC pipeline using Debezium and pydbzengine, capturing change data from PostgreSQL and loading it into DuckDB with the Data Load Tool (DLT). The guide includes a code walkthrough, from setting up the environment and configuring Debezium to executing the pipeline and querying the results in DuckDB.

  8. 8
    Article
    Avatar of detlifeData Engineer Things·1y

    Data Formats and Compression in Data Engineering: Best Practices for CSV, Excel, JSON, Parquet, and Avro

    Choosing the right file format and compression strategy is crucial in data engineering to optimize storage and data transfer speeds. Common data formats include CSV, Excel, JSON, Parquet, and Avro, each with their own pros and cons. Various compression types such as GZIP, BZIP2, Snappy, LZO, and built-in options for formats like Parquet are discussed, alongside their best use cases. The timing of compression—whether during or after data transformation—also significantly impacts efficiency and storage. Practical recommendations include assessing use cases, optimizing at multiple stages, and continuously testing and monitoring the performance impacts of different strategies.