Spotify Research introduces a framework that uses Multimodal Large Language Models (MLLMs) to generate rich text descriptions from video and audio content, significantly improving video recommendation systems. The approach converts raw video frames and audio into semantically dense descriptions that capture intent, humor, and
Sort: