A practical exploration of using PostgreSQL with the pg_search extension to perform full-text search on massive web datasets. The experiment involved indexing 596GB of web data from the Common Crawl Corpus (365 million documents, 156 billion words) on a single Neon instance with 8 vCPUs and 32GB RAM. While pg_search successfully handled the full dataset, search queries took several seconds to minutes due to memory constraints. However, with a 10% subset that fit in memory, most queries completed under one second, demonstrating pg_search's viability for large-scale applications when properly sized.

Sort: