MLCMU's platform is  dedicated to providing insights and resources for machine learning researchers and practitioners. Through articles, research papers, and tutorials, MLCMU offers insights into machine learning algorithms, deep learning models, and AI applications. Readers can learn about  research projects, experimental methodologies, and real-world applications of machine learning to advance their knowledge and skills in the field.

ML CMU

LumberChunker is a document segmentation method that uses an LLM to detect semantic boundaries in long-form narrative text, producing more coherent chunks for RAG pipelines. Instead of fixed token windows or structural cues, it feeds rolling groups of paragraphs (around 550 tokens) to a language model and asks it to identify the earliest point where content meaningfully shifts. Evaluated on GutenQA—a new benchmark of 100 public-domain books with 3,000 retrieval questions—LumberChunker achieves DCG ≈ 62.1% and Recall ≈ 77.9% at k=20, outperforming semantic, recursive, paragraph-level, and proposition-level chunking methods. Downstream QA results show that focused retrieval with LumberChunker beats large-context non-retrieval setups and approaches human-curated segmentation quality.

Machine Learning Blog | ML@CMU | Carnegie Mellon University

GutenQA: A Benchmark for Long-Form Narrative Retrieval