Best of Data Engineering — November 2025

1
Article
Groww Engineering·27w
When Two Databases Become One: How DuckDB Saved Our Trading Operations from Manual Reconciliation
A trading platform faced recurring position-order mismatches across two separate MySQL databases, requiring 20-30 minutes of manual reconciliation by two engineers. By leveraging DuckDB's MySQL scanner extension to perform cross-database joins, they automated the entire process into a 2-3 minute operation running every 15 minutes. The solution eliminated manual intervention, improved accuracy from 85% to 99.9%, and enabled proactive monitoring instead of reactive fixes during market hours.
165
6
2
Article
ByteByteGo·28w
How Spotify Built Its Data Platform To Understand 1.4 Trillion Data Points
Spotify processes 1.4 trillion data points daily through a sophisticated data platform that evolved from a single Hadoop cluster to a multi-product system running on Google Cloud. The platform consists of three core components: data collection (capturing events from millions of devices using client SDKs and Kubernetes operators), data processing (running 38,000+ automated pipelines using BigQuery, Flink, and Apache Beam), and data management (ensuring privacy, security, and compliance). The architecture emphasizes self-service capabilities, allowing product teams to define event schemas and deploy infrastructure automatically while maintaining centralized governance. Built-in anonymization, lineage tracking, and quality checks ensure data trustworthiness across financial reporting, personalized recommendations, and experimentation systems.
67
3
Article
databricks·27w
The New Way to Build Pipelines on Databricks: Introducing the IDE for Data Engineering
Databricks launched the IDE for Data Engineering in Public Preview, a dedicated development environment for building declarative data pipelines within the Databricks Workspace. The IDE provides an integrated experience with features like automatic dependency graph visualization, file-based dataset organization, built-in data previews, debugging tools, and Git integration. It supports the declarative programming paradigm where developers specify what they want to achieve rather than how to build it, with the editor handling execution planning and optimization. The tool consolidates pipeline authoring, testing, version control, and scheduling into a single interface, aiming to improve developer productivity and reduce context switching.
44
4
Article
System Design Codex·29w
Key Concepts of Kafka
Kafka is a distributed event store and streaming platform that has become essential for large-scale data pipelines at companies like Netflix and Uber. The core architecture consists of messages organized into topics and partitions, with producers writing data and consumers reading it in groups. Brokers form clusters that handle message storage and replication for reliability. Key advantages include support for multiple producers and consumers, disk-based retention for durability, and horizontal scalability. However, challenges include complex configuration options, inconsistent tooling, limited client library maturity outside Java/C, and lack of true multi-tenancy.
25

See all Data Engineering archives