A complete Hacker News archive dataset on Hugging Face, covering every story, comment, job, poll, and Ask HN post from October 2006 to present — over 47 million items. The dataset is stored as monthly Parquet files with Zstandard compression, updated every 5 minutes via a live pipeline built in Go using DuckDB and the HN Firebase API. It can be queried directly with DuckDB via hf:// paths, streamed with the Python datasets library, or downloaded in bulk. Detailed usage examples cover top-scored stories, submission trends, domain analysis, and more. Intended for LLM training, trend analysis, community research, and recommendation system development.
Table of contents
What is it?What is being released?Breakdown by todayBreakdown by yearHow to download and use this datasetDataset statisticsContent breakdownHow it worksThanksDataset summaryDataset structureDataset creationConsiderations for using the dataAdditional informationSort: