Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

An experimental comparison of two RAG pipelines—one with query optimization and neighbor expansion, one without—across three datasets (clean corpus questions, messy corpus questions, and random real-world questions). The complex pipeline showed minimal advantage on clean synthetic questions but significantly outperformed on diffuse, multi-faceted queries by reducing fabrication, though at a 41% cost increase and 49% latency penalty. The naive pipeline failed by omission (hallucinating missing information), while the complex pipeline failed by inflation (over-synthesizing across sources). Query optimization helped 38% of questions but hurt 27%, suggesting careful tuning is needed. Most cost comes from the reranker, not added context.

When Does Adding Fancy RAG Features Work?