Evaluating podcast recommendations is notoriously difficult. Offline metrics are quick but biased, while A/B tests provide rigor at the cost of time and resources. To bridge this gap, we propose a profile-aware LLM-as-a-Judge: it summarizes a listener’s tastes and asks an LLM to score candidate episodes or lists against that profile. In a 47-user study, this approach achieved 75% alignment with listener judgments and highlighted meaningful differences between two production-grade models, offering a practical middle ground between offline metrics and A/B testing.

Spotify_Research's publication is a hub for academic research and industry insights in the field of music streaming technology and user behavior analysis. Through research papers, data analysis, and experimental studies, Spotify_Research offers  insights into music recommendation algorithms, user engagement metrics, and content discovery mechanisms. Readers can learn about the latest advancements in music streaming technology, personalized playlist generation, and user-centric design principles to enhance the music listening experience. Additionally, Spotify_Research provides data-driven insights, user studies, and market analysis to help music enthusiasts and industry professionals understand the evolving landscape of digital music consumption and anticipate future trends.

Spotify Research

Spotify Research introduces a profile-aware LLM-as-a-Judge approach for evaluating podcast recommendations that bridges the gap between fast offline metrics and expensive A/B tests. The method creates human-readable user profiles from 90 days of listening history, then uses LLMs to score candidate episodes against these profiles. In a 47-user study, the approach achieved 75% alignment with human judgments and successfully differentiated between production recommendation models, offering a scalable middle ground for recommendation system evaluation.

Profile-aware LLM-as-a-Judge for Podcasts: A Better Middle Ground Between Offline Metrics and A/B Tests

How well does the LLM judge align with user feedback?