Best of Big Data — March 2025

1
Video
Coding with Lewis·1y
How Notion Handles 200 BILLION Notes (Without Crashing)
Notion has managed its rapid growth by adopting sharding to distribute its data across many smaller databases. Initially using a single Postgres database, they experienced slowdowns and shifted to sharding their block model. They later built their own data lake using AWS S3, Apache Spark, and other open-source tools to handle their data processing needs effectively. By reorganizing and scaling up their infrastructure, Notion maintained performance and avoided service interruptions for users.
122
6
2
Article
ByteByteGo·1y
How Netflix Stores 140 Million Hours of Viewing Data Per Day
Netflix handles millions of hours of viewing data daily by using Apache Cassandra for flexible, scalable data storage. The system has evolved to manage the increasing volume and complexity of data, implementing strategies such as horizontal partitioning, compressed storage for older data, and efficient data retrieval methods. To further optimize performance and reduce costs, Netflix redesigned its architecture to categorize data by type and age, improving both storage efficiency and retrieval speeds.
120
3
Article
Community Picks·1y
BigDataBoutique/awesome-opensearch: A curated list of links and resources all about Opensearch. Maintained by the Opensearch experts at BigData Boutique (makers of Pulse for Opensearch)
The resource collection 'awesome-opensearch' is maintained by BigData Boutique. It provides a wide range of links, tools, and articles related to Opensearch, including official documentation, community forums, migration guides, and cost optimization tips. Contributions to the repository are encouraged, with guidelines provided for adding valuable content.
37
1
4
Article
Flink·1y
Apache Flink 2.0.0: A new Era of Real-Time Data Processing
Apache Flink 2.0.0 marks a significant release in the Flink series, introducing new features and architectural enhancements for real-time data processing. Key highlights include Disaggregated State Management, Materialized Tables, and deep integration with Apache Paimon for streaming lakehouse architectures. The release focuses on improving performance, scalability, and resource efficiency, making real-time computing more accessible and practical for diverse use cases. It also includes a new DataStream V2 API and removes several deprecated APIs, resulting in backward-incompatible changes.
30
5
Article
DuckDB·1y
Preview: Amazon S3 Tables in DuckDB
DuckDB announces a new preview feature that supports Apache Iceberg REST Catalogs, enabling easy connection to Amazon S3 Tables and Amazon SageMaker Lakehouse. It allows DuckDB users to read and query Iceberg tables directly from these platforms. The guide provides detailed steps for installing necessary extensions from the core_nightly repository and setting up S3 table buckets. The feature is currently experimental and a stable release is expected later in the year.
24
6
Article
Data Engineer Things·1y
Diving into Data LakeHouse: Databricks 101
Databricks offers an advanced Data Lakehouse solution, integrating the capabilities of Data Lakes and Data Warehouses. Built on Apache Spark, Databricks enables efficient data processing, data analysis, machine learning, and business intelligence. The platform leverages Delta Lake for reliable and consistent data storage, along with Unity Catalog for centralized access control and data management.
13
7
Article
DuckDB·1y
Parquet Bloom Filters in DuckDB
DuckDB now supports reading and writing Parquet Bloom filters, which help in selectively reading relevant data for queries by using compact index structures. The new feature is transparent to users and significantly improves query performance, especially in scenarios with large Parquet files or slow network connections. Bloom filters are supported for various data types, including integers, floating points, and strings, but not yet for nested types.
10

See all Big Data archives