Spotify Research presents a hybrid approach for personalizing AI-powered music recommendations using LLM-based agentic systems. The method combines reward models with Direct Preference Optimization (DPO) to create a continuous learning flywheel that adapts to user preferences from listening behavior. The system interprets natural language queries, orchestrates music search tools, and learns from user interactions like plays, skips, and saves. Production A/B tests showed 4% increase in listening time, higher playlist saves, and 70% reduction in erroneous tool calls while maintaining quality standards.

9m read timeFrom research.atspotify.com
Post cover image
Table of contents
Limitations of traditional approachesA hybrid approach: Reward Models + Direct Preference OptimizationThe Preference Tuning FlywheelWhy reward models matterStable, scalable fine-tuningOnline experimentsEngineering practices that made the differenceLooking aheadAcknowledgments

Sort: