How Notion build and grew our data lake to keep up with rapid growth

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

Notion's data has grown 10x in three years, necessitating the creation and scaling of a dedicated data lake. Their initial architecture involved a complex sharded Postgres infrastructure but faced challenges with operability, speed, and cost. To manage these issues, they developed an in-house data lake using AWS S3 for storage and Apache Spark for processing, coupled with a Kafka-based ingestion system using Debezium CDC connectors. This scalable setup improved data ingestion times, reduced costs, and supported their AI and analytical needs. The data lake supports update-heavy block data and allows complex data transformations, making it efficient for both small and large-scale data operations.

Building and scaling Notion’s data lake

Notion’s data warehouse architecture in 2021

Building and scaling Notion’s in-house data lake

The payoff: less money, more time, stronger infrastructure for AI