High-level summary: problems, approaches, and takeways for better RAG with MLflow

mlflow

A structured RAG benchmarking workflow using MLflow that systematically tunes vector search configurations one knob at a time. Three benchmark rounds are run: comparing embedding models (gte-large-en vs. bge-large-en), chunk sizes (256/512/1024 tokens), and search modes (ANN vs. hybrid vs. hybrid+reranker). Each round uses MLflow's precision@k, recall@k, and nDCG@k metrics to measure impact, with all runs tracked in the MLflow Experiment UI for side-by-side comparison. The workflow is demonstrated across Databricks Vector Search, Pinecone, and LanceDB. The best configuration — bge-large-en with hybrid search and a reranker — achieved a 7% nDCG@5 lift over the baseline. The post also suggests integrating the harness with Optuna for hyperparameter sweeps and wiring benchmarks into CI/CD pipelines.

Benchmark Your Way to Better RAG and Agents:Tuning Vector Search with MLflow

High-level summary: problems, approaches, and takeways for better RAG with MLflow ​

Step 1: Build a Ground-Truth Benchmark Set ​

Step 2: Write a Retriever for Each Vector Store ​

Round 1: Benchmarking Embedding Models ​