Recommending the right short- or long-form video, on TikTok, Reels, YouTube, Spotify, and beyond, remains challenging, because standard video and audio encoders capture pixels and waveforms but miss intent, parody, and world knowledge: the very reasons a clip might resonate with users and drive engagement.

Spotify_Research's publication is a hub for academic research and industry insights in the field of music streaming technology and user behavior analysis. Through research papers, data analysis, and experimental studies, Spotify_Research offers  insights into music recommendation algorithms, user engagement metrics, and content discovery mechanisms. Readers can learn about the latest advancements in music streaming technology, personalized playlist generation, and user-centric design principles to enhance the music listening experience. Additionally, Spotify_Research provides data-driven insights, user studies, and market analysis to help music enthusiasts and industry professionals understand the evolving landscape of digital music consumption and anticipate future trends.

Spotify Research

Spotify Research introduces a framework that uses Multimodal Large Language Models (MLLMs) to generate rich text descriptions from video and audio content, significantly improving video recommendation systems. The approach converts raw video frames and audio into semantically dense descriptions that capture intent, humor, and world knowledge - elements traditional encoders miss. Testing on the MicroLens-100K dataset showed performance improvements of up to 60% when integrated with standard recommendation architectures like two-tower models and SASRec, with particularly strong gains for longer videos.

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations