Spotify Research presents a hybrid approach for personalizing AI-powered music recommendations using LLM-based agentic systems. The method combines reward models with Direct Preference Optimization (DPO) to create a continuous learning flywheel that adapts to user preferences from listening behavior. The system interprets
Table of contents
Limitations of traditional approachesA hybrid approach: Reward Models + Direct Preference OptimizationThe Preference Tuning FlywheelWhy reward models matterStable, scalable fine-tuningOnline experimentsEngineering practices that made the differenceLooking aheadAcknowledgmentsSort: