Best of Data Engineering — October 2025

1
Article
ByteByteGo·30w
How Nubank Built an In-house Logging Platform for 1 Trillion Log Entries
Nubank built an in-house logging platform to replace a costly third-party vendor, handling 1 trillion daily log entries at 50% lower cost. The solution uses a two-phase architecture: an ingestion pipeline with Fluent Bit, custom buffering, and processing services, plus a query/storage layer combining Trino, AWS S3, and Parquet format. The platform processes 1 petabyte daily, maintains 45 petabytes of searchable data with 45-day retention, and serves 15,000 queries daily scanning 150 petabytes. Key design decisions included decoupling ingestion from querying, implementing micro-batching for reliability, and achieving 95% data compression with Parquet.
49
2
Article
ByteByteGo·31w
How Pinterest Transfers Hundreds of Terabytes of Data With CDC
Pinterest built a unified Change Data Capture platform to handle thousands of database shards and millions of queries per second. The system uses Debezium and Apache Kafka with a two-layer architecture: a control plane that manages connector configurations and a data plane that streams database changes. Key challenges included out-of-memory errors from large backlogs, frequent task rebalancing causing instability, slow failover recovery taking over two hours, and duplicate tasks from a Kafka bug. Solutions involved bootstrapping from latest offsets, increasing rebalance timeouts to 10 minutes, enabling worker-level shard discovery, and upgrading to Kafka 2.8.2 version 3.6, which reduced CPU usage from 99% to 45% and stabilized the system to run 3,000 tasks reliably.
39
3
Article
Debezium·34w
Debezium 3.3.0.Final Released
Debezium 3.3.0.Final introduces major enhancements including a new Quarkus extension for PostgreSQL integration, a CockroachDB connector, Apache Kafka 4.1 support, and exactly-once semantics for all core connectors. The release includes OpenLineage support for MongoDB and JDBC sink connectors, improved performance optimizations across Oracle, PostgreSQL, and MySQL connectors, and enhanced Debezium Platform features like smart editor and connection management. Breaking changes include removal of deprecated snapshot modes and updates to JDBC sink data type precision handling.
33
4
Article
Facebook Engineering·33w
Introducing OpenZL: An Open Source Format-Aware Compression Framework
Meta released OpenZL, an open source lossless compression framework for structured data that achieves format-specific compression performance while maintaining a single universal decompressor. The framework applies configurable transformation sequences based on data structure descriptions, uses an offline trainer to optimize compression plans, and supports runtime adaptation without requiring decoder updates. OpenZL demonstrates significant improvements over general-purpose compressors like Zstandard and XZ on structured datasets, offering better compression ratios while maintaining or improving speed. The system includes a Simple Data Description Language (SDDL) for defining data shapes and integrates with Meta's Managed Compression infrastructure for automated retraining as data evolves.
22

See all Data Engineering archives