Spotify Research introduces a profile-aware LLM-as-a-Judge approach for evaluating podcast recommendations that bridges the gap between fast offline metrics and expensive A/B tests. The method creates human-readable user profiles from 90 days of listening history, then uses LLMs to score candidate episodes against these

6m read timeFrom research.atspotify.com
Post cover image
Table of contents
ContextThe core ideaHow the pipeline worksHow well does the LLM judge align with user feedback?Richer profiles lead to better judgmentsSome final words

Sort: