BM25 is having a moment. We reproduce Doug Turnbull’s MSMARCO autoresearch experiment in Vespa and get a comparable MRR@10 lift from existing rank features — with twice the generalization to full MSMARCO.

Vespa Blog

A reproduction and extension of Doug Turnbull's LLM-driven autoresearch experiment on MSMARCO BM25 ranking, implemented in Vespa. Instead of letting an LLM write arbitrary Python reranking code, the authors constrain the search space to existing Vespa rank features. Three key improvements are identified: an aggressive weakAnd stopword limit (0.05), nativeProximity scoring, and fieldMatch.earliness. The manual sweep achieves MRR@10 of 0.5163 on the minimarco subset (+0.026 over BM25), and crucially, 80% of that gain transfers to the full 8.84M-document MSMARCO corpus — compared to only 21% retention for the free-form Python agent. An autonomous LLM agent constrained to Vespa rank features achieves 99% retention. The key insight is that constraining the search space to validated, generalizable rank features prevents overfitting to the evaluation subset.

Re-autoresearching MSMARCO BM25, on Vespa