Best of Data Engineering — December 2025

1
Article
MotherDuck·22w
Stop Paying the Complexity Tax
Most organizations don't need massive distributed data systems. The industry has over-engineered solutions for edge cases, forcing everyone to pay a complexity tax for scale they'll never require. Modern single-machine databases can handle what previously required distributed systems, with machines now offering 192 cores and 1.5TB of memory. By separating storage (cheap, infinite object storage) from compute (ephemeral, cloneable instances), and designing for the common case of small data with occasional big compute needs, teams can achieve better performance with dramatically simpler architecture. DuckDB exemplifies this approach by focusing on the complete user experience, not just query performance, while MotherDuck extends it with cloud durability and per-user isolation through individual database instances that spin up in under 100ms.
84
2
Article
ByteByteGo·25w
How Netflix Built a Distributed Write Ahead Log For Its Data Platform
Netflix built a distributed Write-Ahead Log (WAL) system to solve data reliability issues across their platform. The WAL captures every data change before applying it to databases, enabling automatic retries, cross-region replication, and multi-partition consistency. Built on top of their Data Gateway Infrastructure, it uses Kafka and Amazon SQS as pluggable backends, supports multiple use cases through namespaces, and scales independently through sharded deployments. The system provides durability guarantees while allowing teams to configure retry logic, delays, and targets without code changes.
82
3
Article
Supabase·24w
Introducing iceberg-js: A JavaScript Client for Apache Iceberg
Supabase released iceberg-js, an open-source JavaScript/TypeScript client for Apache Iceberg REST Catalog API. The library provides type-safe catalog management for namespaces and tables, works across all JavaScript environments, and is intentionally minimal—it handles only catalog operations, not data reads/writes or query execution. Built to power Supabase's Analytics Buckets feature, it's vendor-agnostic, uses native fetch API, and supports multiple authentication methods. The MIT-licensed library is available on GitHub and npm.
75
4
Article
InfoQ·22w
Decathlon Switches to Polars to Optimize Data Pipelines and Infrastructure Costs
Decathlon migrated data pipelines processing small to mid-size datasets (under 50 GiB) from Apache Spark clusters to Polars running on single Kubernetes pods. The switch reduced compute launch time from 8 to 2 minutes and significantly lowered infrastructure costs. Polars' streaming engine enables processing datasets larger than available memory on modest hardware. The team now uses Polars for new pipelines with stable, smaller input tables that don't require complex joins or aggregations, while keeping Spark for terabyte-scale workloads. Challenges include managing Kubernetes infrastructure and limitations with certain Delta Lake features.
70

See all Data Engineering archives