A reproduction and extension of Doug Turnbull's LLM-driven autoresearch experiment on MSMARCO BM25 ranking, implemented in Vespa. Instead of letting an LLM write arbitrary Python reranking code, the authors constrain the search space to existing Vespa rank features. Three key improvements are identified: an aggressive weakAnd stopword limit (0.05), nativeProximity scoring, and fieldMatch.earliness. The manual sweep achieves MRR@10 of 0.5163 on the minimarco subset (+0.026 over BM25), and crucially, 80% of that gain transfers to the full 8.84M-document MSMARCO corpus — compared to only 21% retention for the free-form Python agent. An autonomous LLM agent constrained to Vespa rank features achieves 99% retention. The key insight is that constraining the search space to validated, generalizable rank features prevents overfitting to the evaluation subset.
Table of contents
BM25 is having a momentBuilding on BM25Reproducing Doug’s setupThe three tweaks that moved the needleWhat happens on full MSMARCO?Our own “autoresearch” loopOur final configGoing furtherNotesSort: