TL;DR — We are excited to announce voyage-multimodal-3, a new state-of-the-art for multimodal embeddings and a big step forward towards seamless RAG and semantic search for documents rich with both visuals and text. Unlike existing multimodal embedding models, voyage-multimodal-3 is capable of vectorizing interleaved texts + images and capturing key visual features from screenshots of…

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

Voyage-multimodal-3 is a new state-of-the-art model for multimodal embeddings, capable of vectorizing interleaved text and images and capturing key visual features from various sources like PDFs, slides, and tables. It outperforms leading models like OpenAI CLIP and Cohere multimodal v3 in retrieval tasks, eliminating the need for complex document parsing. This model processes both text and visuals within the same transformer encoder, providing robust performance for mixed-modality searches.

voyage-multimodal-3: all-in-one embedding model for interleaved text, images, and screenshots