Spotify Research introduces a profile-aware LLM-as-a-Judge approach for evaluating podcast recommendations that bridges the gap between fast offline metrics and expensive A/B tests. The method creates human-readable user profiles from 90 days of listening history, then uses LLMs to score candidate episodes against these profiles. In a 47-user study, the approach achieved 75% alignment with human judgments and successfully differentiated between production recommendation models, offering a scalable middle ground for recommendation system evaluation.
Table of contents
ContextThe core ideaHow the pipeline worksHow well does the LLM judge align with user feedback?Richer profiles lead to better judgmentsSome final wordsSort: