Best of Data EngineeringDecember 2025

  1. 1
    Article
    Avatar of motherduckMotherDuck·16w

    Stop Paying the Complexity Tax

    Most organizations don't need massive distributed data systems. The industry has over-engineered solutions for edge cases, forcing everyone to pay a complexity tax for scale they'll never require. Modern single-machine databases can handle what previously required distributed systems, with machines now offering 192 cores and 1.5TB of memory. By separating storage (cheap, infinite object storage) from compute (ephemeral, cloneable instances), and designing for the common case of small data with occasional big compute needs, teams can achieve better performance with dramatically simpler architecture. DuckDB exemplifies this approach by focusing on the complete user experience, not just query performance, while MotherDuck extends it with cloud durability and per-user isolation through individual database instances that spin up in under 100ms.

  2. 2
    Article
    Avatar of bytebytegoByteByteGo·19w

    How Netflix Built a Distributed Write Ahead Log For Its Data Platform

    Netflix built a distributed Write-Ahead Log (WAL) system to solve data reliability issues across their platform. The WAL captures every data change before applying it to databases, enabling automatic retries, cross-region replication, and multi-partition consistency. Built on top of their Data Gateway Infrastructure, it uses Kafka and Amazon SQS as pluggable backends, supports multiple use cases through namespaces, and scales independently through sharded deployments. The system provides durability guarantees while allowing teams to configure retry logic, delays, and targets without code changes.

  3. 3
    Article
    Avatar of supabaseSupabase·18w

    Introducing iceberg-js: A JavaScript Client for Apache Iceberg

    Supabase released iceberg-js, an open-source JavaScript/TypeScript client for Apache Iceberg REST Catalog API. The library provides type-safe catalog management for namespaces and tables, works across all JavaScript environments, and is intentionally minimal—it handles only catalog operations, not data reads/writes or query execution. Built to power Supabase's Analytics Buckets feature, it's vendor-agnostic, uses native fetch API, and supports multiple authentication methods. The MIT-licensed library is available on GitHub and npm.

  4. 4
    Article
    Avatar of infoqInfoQ·17w

    Decathlon Switches to Polars to Optimize Data Pipelines and Infrastructure Costs

    Decathlon migrated data pipelines processing small to mid-size datasets (under 50 GiB) from Apache Spark clusters to Polars running on single Kubernetes pods. The switch reduced compute launch time from 8 to 2 minutes and significantly lowered infrastructure costs. Polars' streaming engine enables processing datasets larger than available memory on modest hardware. The team now uses Polars for new pipelines with stable, smaller input tables that don't require complex joins or aggregations, while keeping Spark for terabyte-scale workloads. Challenges include managing Kubernetes infrastructure and limitations with certain Delta Lake features.