Spotify Research introduces a framework that uses Multimodal Large Language Models (MLLMs) to generate rich text descriptions from video and audio content, significantly improving video recommendation systems. The approach converts raw video frames and audio into semantically dense descriptions that capture intent, humor, and world knowledge - elements traditional encoders miss. Testing on the MicroLens-100K dataset showed performance improvements of up to 60% when integrated with standard recommendation architectures like two-tower models and SASRec, with particularly strong gains for longer videos.

6m read timeFrom research.atspotify.com
Post cover image
Table of contents
The FrameworkEmpirical EvaluationTakeawaysReferences

Sort: