LumberChunker is a document segmentation method that uses an LLM to detect semantic boundaries in long-form narrative text, producing more coherent chunks for RAG pipelines. Instead of fixed token windows or structural cues, it feeds rolling groups of paragraphs (around 550 tokens) to a language model and asks it to identify

7m read timeFrom blog.ml.cmu.edu
Post cover image
Table of contents
IntroductionThe LumberChunker MethodGutenQA: A Benchmark for Long-Form Narrative RetrievalKey FindingsConclusionCitation

Sort: