LumberChunker is a document segmentation method that uses an LLM to detect semantic boundaries in long-form narrative text, producing more coherent chunks for RAG pipelines. Instead of fixed token windows or structural cues, it feeds rolling groups of paragraphs (around 550 tokens) to a language model and asks it to identify
Table of contents
IntroductionThe LumberChunker MethodGutenQA: A Benchmark for Long-Form Narrative RetrievalKey FindingsConclusionCitationSort: