Best of Big Data — 2025

1
Article
DEV·1y
How Programming Will Look In the Future?
Programming has largely stuck to the von Neumann paradigm since the 1940s, but modern hardware with multiple cores faces challenges with this model. Traditional concurrent programming solutions like Go's goroutines introduce complexities. Data flow programming offers an alternative by treating programs as networks of independent nodes that pass data, avoiding race conditions and allowing natural parallelism. Nevalang is a new language built around this paradigm, offering a promising future for programming. However, it is still in development and looking for contributors.
1K
100
2
Article
BigData Boutique blog·1y
Elasticsearch vs OpenSearch - 2025 update
An in-depth 2025 update comparing Elasticsearch and OpenSearch, touching on project status, performance, licensing, vector search capabilities, cost efficiency, and ecosystem solutions. OpenSearch has gained traction with open-source governance and additional vector search engines, while Elasticsearch maintains proprietary features and extensive integration solutions.
178
1
3
Article
Data Engineer Things·1y
End to End Data Engineering
This post details the tools, technologies, and concepts essential for data engineering, emphasizing different paths for success based on roles and backgrounds. It highlights the importance of both analytics and infrastructure sides and mentions popular tools like Airflow and Snowflake. The significance of software engineering principles and specific data engineering roles is also discussed.
161
1
4
Article
Hacker News·46w
kepler.gl
Kepler.gl is a WebGL-powered geospatial data visualization tool designed for analyzing and visualizing large-scale datasets in web browsers. Built with high-performance rendering capabilities, it enables interactive exploration of geographic data. Foursquare Studio extends kepler.gl's framework as a free analytics platform with regular feature updates.
154
2
5
Video
Coding with Lewis·1y
How Notion Handles 200 BILLION Notes (Without Crashing)
Notion has managed its rapid growth by adopting sharding to distribute its data across many smaller databases. Initially using a single Postgres database, they experienced slowdowns and shifted to sharding their block model. They later built their own data lake using AWS S3, Apache Spark, and other open-source tools to handle their data processing needs effectively. By reorganizing and scaling up their infrastructure, Notion maintained performance and avoided service interruptions for users.
122
6
6
Article
ByteByteGo·1y
How Netflix Stores 140 Million Hours of Viewing Data Per Day
Netflix handles millions of hours of viewing data daily by using Apache Cassandra for flexible, scalable data storage. The system has evolved to manage the increasing volume and complexity of data, implementing strategies such as horizontal partitioning, compressed storage for older data, and efficient data retrieval methods. To further optimize performance and reduce costs, Netflix redesigned its architecture to categorize data by type and age, improving both storage efficiency and retrieval speeds.
120
7
Article
Salesforce Engineering·30w
Architecting Multi-System Production Platform
Salesforce built Digital Wallet, a consumption-based pricing platform serving 15,000+ organizations and generating $400M+ in annual contract value. The engineering team overcame significant challenges as Data Cloud's first customer, including implementing SOX-compliant metadata security through Strict System Mode, building a custom event subscriber processing 20M daily events, and architecting failover strategies for near real-time usage tracking. The platform integrates multiple systems using fan-out mechanisms for entitlement sync, implements Spark job failover between EMR-on-EKS and EMR-on-EC2 to avoid rate limits, and maintains billing accuracy through architectural separation of hourly customer-facing updates from monthly financial reconciliation. The system includes high-cardinality monitoring, automatic retry logic, and a month-long buffer for usage reconciliation before billing.
106
8
Article
Sysco LABS Sri Lanka·1y
Event-Driven Architecture: How Enterprises Manage Billions of Events
Event-Driven Architecture (EDA) is a software design pattern gaining popularity for managing Big Data, microservices, and real-time processing. EDA decouples services, enhancing scalability, resilience, and efficiency. It facilitates asynchronous communication through events, enabling systems to handle real-time data effectively. The post covers the benefits of EDA, its key components, real-world applications in companies like Sysco and Uber, and compares EDA with service mesh architecture. It also highlights the scalability, flexibility, and potential challenges of implementing EDA.
101
2
9
Article
Confluent Blog·1y
The Future of AI Agents is Event-Driven
AI agents are poised to transform enterprise operations by adopting event-driven architecture. This architectural approach addresses interoperability challenges and enhances scalability. EDA allows agents to operate independently, integrate seamlessly, and adapt workflows dynamically, overcoming the limitations of fixed workflows and tightly coupled systems. It ensures agents can effectively handle complex, interconnected tasks, thereby unlocking their full potential. The article highlights the importance of EDA in creating resilient, scalable AI systems and warns against the risks of outdated architecture in the evolving AI landscape.
89
1
10
Article
Quastor Daily·1y
The Architecture of Grab's Data Lake
Grab, a leading tech company in Southeast Asia, uses a data lake to manage its vast data, generated from services like ride-sharing and food delivery. The company uses Apache Avro with a Merge on Read strategy for high-throughput data, allowing efficient writes and periodic compaction to manage read costs. For low-throughput data, Grab uses Parquet with Copy on Write to ensure fast reads and data consistency. The post also discusses various data storage formats and their trade-offs in terms of readability, compression, and schema evolution.
88
11
Article
DuckDB·46w
DuckLake 0.2
DuckLake 0.2 introduces significant improvements including secrets management for credentials, enhanced Parquet file settings, relative schema/table paths for better organization, name mapping for existing Parquet files, scoped settings at schema and table levels, and partition transforms. The update includes automatic migration from v0.1 and adds new functions like ducklake_list_files for better system integration.
69
12
Article
ByteByteGo·28w
How Spotify Built Its Data Platform To Understand 1.4 Trillion Data Points
Spotify processes 1.4 trillion data points daily through a sophisticated data platform that evolved from a single Hadoop cluster to a multi-product system running on Google Cloud. The platform consists of three core components: data collection (capturing events from millions of devices using client SDKs and Kubernetes operators), data processing (running 38,000+ automated pipelines using BigQuery, Flink, and Apache Beam), and data management (ensuring privacy, security, and compliance). The architecture emphasizes self-service capabilities, allowing product teams to define event schemas and deploy infrastructure automatically while maintaining centralized governance. Built-in anonymization, lineage tracking, and quality checks ensure data trustworthiness across financial reporting, personalized recommendations, and experimentation systems.
67
13
Article
Baeldung·1y
Introduction to Apache Kylin
Apache Kylin is an open-source OLAP engine designed for sub-second query performance on massive datasets. Initially developed by eBay and later managed by the Apache Software Foundation, it excels in handling high concurrency and integrates seamlessly with Hadoop and data lake platforms. Key features include multidimensional modeling, optimized indexing, and support for both batch and streaming data sources. The platform can be easily explored using Docker, allowing for straightforward setup, model creation, and CUBE building via SQL and REST API.
63
14
Article
Data Engineer Things·1y
I spent 6 hours learning AWS Glue. Here is what I found
AWS Glue is a serverless data integration service that simplifies and automates the ETL process, enabling users to integrate data from various sources, preprocess and transform it, and make it available for analytics. It seamlessly integrates with AWS services like S3, Redshift, and Athena and supports cost-effective and scalable data processing. Key components include Glue Studio, Glue ETL Library with DynamicFrames, and serverless execution with auto-scaling. The Glue Data Catalog acts as a central repository for metadata, facilitating efficient data discovery and management.
56
1
15
Article
Decube·34w
Lessons Learned in Data Engineering 2025: Do’s, Don’ts & Best Practices
A comprehensive guide sharing 15 years of data engineering experience, covering essential practices for 2025. Key recommendations include implementing data lineage from day one, establishing data contracts, investing in observability over monitoring, treating metadata as critical infrastructure, and building for change rather than stability. The guide emphasizes that modern data engineering is about creating trust in data rather than just moving it, especially as organizations become AI-ready and navigate multi-cloud environments.
47
1
16
Article
databricks·52w
Introducing Apache Spark 4.0
Apache Spark 4.0 introduces key advancements in SQL language, Python support, structured streaming, and usability, enhancing big data processing. Notable features include improved multi-language compatibility, new SQL scripting capabilities, enhanced Python APIs, and structured logging. This release offers greater modularity, scalability, and standards compliance, making it future-ready for large-scale data analytics.
43
17
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
FireDucks vs. Pandas vs. DuckDB vs. Polars
FireDucks is an optimized alternative to Pandas with the same API, requiring just an import replacement to use. It demonstrates a significant speed boost for big data operations, achieving an average speed-up of 125x over Pandas. FireDucks' lazy execution builds and optimizes a logical execution plan, unlike Pandas' immediate execution. It can be used with IPython, Jupyter Notebooks, or within existing Pandas pipelines by replacing import statements. Detailed benchmarks and usage examples are provided, showing substantial performance improvements in practical scenarios.
38
1
18
Article
Community Picks·1y
BigDataBoutique/awesome-opensearch: A curated list of links and resources all about Opensearch. Maintained by the Opensearch experts at BigData Boutique (makers of Pulse for Opensearch)
The resource collection 'awesome-opensearch' is maintained by BigData Boutique. It provides a wide range of links, tools, and articles related to Opensearch, including official documentation, community forums, migration guides, and cost optimization tips. Contributions to the repository are encouraged, with guidelines provided for adding valuable content.
37
1
19
Article
databricks·38w
Architecting a High-Concurrency, Low-Latency Data Warehouse on Databricks That Scales
A comprehensive guide to building high-performance data warehouses on Databricks that handle hundreds of concurrent users with sub-second query response times. Covers architectural best practices including SQL Serverless Warehouses, Liquid Clustering, Unity Catalog governance, and AI-powered optimizations. Provides a structured framework for assessment, implementation, and monitoring, with real-world case study showing how an email marketing platform reduced costs while improving performance through materialized views and modern data organization techniques.
31
20
Article
Decube·1y
Introducing Decube's Public API
Decube has released its Public API to streamline data governance workflows. The API facilitates bulk management of glossaries, manual lineages, and user groups, enhancing efficiency and scalability. It also ensures full accountability through secure audit logging. Upcoming features include data quality scores and monitor configuration, furthering Decube's mission to empower data teams.
31
21
Article
Flink·1y
Apache Flink 2.0.0: A new Era of Real-Time Data Processing
Apache Flink 2.0.0 marks a significant release in the Flink series, introducing new features and architectural enhancements for real-time data processing. Key highlights include Disaggregated State Management, Materialized Tables, and deep integration with Apache Paimon for streaming lakehouse architectures. The release focuses on improving performance, scalability, and resource efficiency, making real-time computing more accessible and practical for diverse use cases. It also includes a new DataStream V2 API and removes several deprecated APIs, resulting in backward-incompatible changes.
30
22
Article
DuckDB·1y
Preview: Amazon S3 Tables in DuckDB
DuckDB announces a new preview feature that supports Apache Iceberg REST Catalogs, enabling easy connection to Amazon S3 Tables and Amazon SageMaker Lakehouse. It allows DuckDB users to read and query Iceberg tables directly from these platforms. The guide provides detailed steps for installing necessary extensions from the core_nightly repository and setting up S3 table buckets. The feature is currently experimental and a stable release is expected later in the year.
24
23
Article
Salesforce Engineering·47w
Architecting AI Agent Auditing Systems in Agentforce
Salesforce's Feedback and Audit Trail team built an AI auditing system for Agentforce that handles 20 million model interactions monthly across 500 enterprise customers. The system overcame significant integration challenges with Data Cloud by using Kafka-based ingestion to manage unpredictable AI traffic patterns. Key technical solutions included dynamic flow control mechanisms, Tiger Team coordination across 8-10 cross-functional teams, and iterative development approaches. The architecture prioritizes trust, security, and compliance while maintaining scalability through continuous performance monitoring and architectural improvements.
23
24
Article
Data Engineering·1y
Data Engineering Vault: 1000+ Interconnected Concepts for Data Engineers
The Data Engineering Vault is a curated collection of over 1,000 interconnected concepts designed to form a comprehensive knowledge base for data engineers. It includes detailed notes on the data engineering lifecycle, various data modeling approaches, modern data infrastructure, data transformation paradigms, analytics, and specialized techniques. The vault offers interconnected learning paths, historical context, practical applications, and recommendations for essential resources and thought leaders in the field.
23
25
Article
Data Engineer Things·1y
Why I Love Python as Data Engineer
Python is favored by data engineers for its versatility, simplicity, and rich library ecosystem. It excels in both small and large-scale data tasks, making data manipulation and automation easier. Despite some limitations like slower execution speed and memory consumption, its readable code and efficient debugging make it a preferred choice for many. Python integrates well with tools like Apache Spark and libraries for data visualization, adding to its appeal.
20

See all Big Data archives