LumberChunker is a document segmentation method that uses an LLM to detect semantic boundaries in long-form narrative text, producing more coherent chunks for RAG pipelines. Instead of fixed token windows or structural cues, it feeds rolling groups of paragraphs (around 550 tokens) to a language model and asks it to identify the earliest point where content meaningfully shifts. Evaluated on GutenQA—a new benchmark of 100 public-domain books with 3,000 retrieval questions—LumberChunker achieves DCG ≈ 62.1% and Recall ≈ 77.9% at k=20, outperforming semantic, recursive, paragraph-level, and proposition-level chunking methods. Downstream QA results show that focused retrieval with LumberChunker beats large-context non-retrieval setups and approaches human-curated segmentation quality.

7m read timeFrom blog.ml.cmu.edu
Post cover image
Table of contents
IntroductionThe LumberChunker MethodGutenQA: A Benchmark for Long-Form Narrative RetrievalKey FindingsConclusionCitation

Sort: