Pinterest Search scaled their relevance assessment by fine-tuning open-source multilingual LLMs (XLM-RoBERTa-large) on human-annotated data to predict search result relevance. This approach reduced labeling costs and time while achieving 73.7% exact match with human labels and strong correlation metrics (Kendall's τ>0.5). By enabling stratified sampling designs with larger query sets, they reduced minimum detectable effects from 1.3-1.5% to ≤0.25%, primarily through variance reduction. The system successfully evaluates A/B experiments across multiple languages and query popularity segments, generating sDCG@K metrics for ranking quality assessment.

9m read timeFrom medium.com
Post cover image
Table of contents
IntroductionMethodologyFine-tuned LLMs as Relevance ModelStratified Sampling DesignRelevance Measurement with LLMsResultsSummaryFuture WorkAcknowledgement

Sort: