An experiment using an AI coding agent to iteratively improve BM25 search ranking on the MSMarco passage retrieval dataset. The agent starts with a baseline BM25 implementation and proposes code changes, accepting only those that improve NDCG on validation data. After 8 rounds, the agent discovered stopword removal for longer queries and a bigram phrase boost, achieving MRR near 0.2. However, gains plateaued due to overfitting to the minimarco sample — including odd stopwords like 'medicine' and 'vacation' that leaked from the validation set. The post reflects on lessons learned about data leakage in automated tuning and outlines future directions including better context management and using the full dataset.

8m read timeFrom softwaredoug.com
Post cover image
Table of contents
The start codeThe training processThe resultsStill, a useful tuning tool

Sort: