Best of Data Engineering — September 2025

1
Article
System Design Newsletter·35w
How Kafka Works
Apache Kafka is a distributed, fault-tolerant pub/sub messaging system built on a simple log data structure. It uses brokers for horizontal scaling, partitions for data sharding, and replication for durability. The system employs KRaft consensus for leader election and metadata management. Key features include tiered storage for cost optimization, consumer groups for parallel processing, transactions for exactly-once semantics, and ecosystem components like Kafka Streams for stream processing and Kafka Connect for system integration.
60
1
2
Article
Decube·34w
Lessons Learned in Data Engineering 2025: Do’s, Don’ts & Best Practices
A comprehensive guide sharing 15 years of data engineering experience, covering essential practices for 2025. Key recommendations include implementing data lineage from day one, establishing data contracts, investing in observability over monitoring, treating metadata as critical infrastructure, and building for change rather than stability. The guide emphasizes that modern data engineering is about creating trust in data rather than just moving it, especially as organizations become AI-ready and navigate multi-cloud environments.
47
1
3
Article
Netflix TechBlog·35w
Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale
Netflix evolved their Muse analytics platform to handle trillion-row datasets by implementing HyperLogLog sketches for approximate distinct counts, using Hollow for in-memory precomputed aggregates, and extensively tuning their Apache Druid cluster. The migration reduced query latencies by 50% while supporting advanced filtering capabilities for creative content insights. The team used parallel stack deployment, automated validation, and granular feature flags to ensure data accuracy during the transition.
23
4
Article
Google Open Source Blog·35w
Apache Iceberg 1.10: Maturing the V3 spec, the REST API and Google contributions
Apache Iceberg 1.10.0 introduces major improvements including full Spark 4.0 and Flink 2.0 compatibility, production-ready Deletion Vectors for faster row-level updates, and a hardened REST Catalog API. The release matures the V3 specification with features like row lineage and variant types. Google contributed native BigQuery Metastore Catalog support and Google AuthManager, enabling seamless integration with BigLake-managed tables through open REST protocols.
16
5
Article
Towards Data Science·36w
Building Research Agents for Tech Insights
A comprehensive guide to building specialized AI research agents that can aggregate and analyze tech content from multiple sources. The approach uses structured workflows, data caching, and prompt chaining to create personalized tech reports. Key components include preprocessing data pipelines, strategic use of small vs large language models for cost optimization, and structured JSON outputs for reliability. The system fetches trending keywords, processes facts from tech forums, and generates themed reports based on user profiles.
12
1

See all Data Engineering archives