Spotify Research introduces a framework that uses Multimodal Large Language Models (MLLMs) to generate rich text descriptions from video and audio content, significantly improving video recommendation systems. The approach converts raw video frames and audio into semantically dense descriptions that capture intent, humor, and

6m read timeFrom research.atspotify.com
Post cover image
Table of contents
The FrameworkEmpirical EvaluationTakeawaysReferences

Sort: