A structured RAG benchmarking workflow using MLflow that systematically tunes vector search configurations one knob at a time. Three benchmark rounds are run: comparing embedding models (gte-large-en vs. bge-large-en), chunk sizes (256/512/1024 tokens), and search modes (ANN vs. hybrid vs. hybrid+reranker). Each round uses
Table of contents
High-level summary: problems, approaches, and takeways for better RAG with MLflow The Benchmark Plan Step 1: Build a Ground-Truth Benchmark Set Step 2: Write a Retriever for Each Vector Store Step 3: Build the Benchmark Harness Round 1: Benchmarking Embedding Models Round 2: Benchmarking Chunk Size Round 3: Benchmarking Search Modes Step 4: Read the Leaderboard What We Learned What's Next? Resources and References Sort: