Notion's data has grown 10x in three years, necessitating the creation and scaling of a dedicated data lake. Their initial architecture involved a complex sharded Postgres infrastructure but faced challenges with operability, speed, and cost. To manage these issues, they developed an in-house data lake using AWS S3 for storage and Apache Spark for processing, coupled with a Kafka-based ingestion system using Debezium CDC connectors. This scalable setup improved data ingestion times, reduced costs, and supported their AI and analytical needs. The data lake supports update-heavy block data and allows complex data transformations, making it efficient for both small and large-scale data operations.

14m read timeFrom notion.so
Post cover image
Table of contents
Notion’s data model and growthNotion’s data warehouse architecture in 2021Building and scaling Notion’s in-house data lakeScaling and operating our data lakeThe payoff: less money, more time, stronger infrastructure for AI

Sort: