Multimodal Embedding & Reranker Models with Sentence Transformers

Sentence Transformers v5.4 adds multimodal support, enabling encoding and comparison of text, images, audio, and video using the same familiar API. The update introduces multimodal embedding models that map different modalities into a shared vector space, and multimodal reranker (CrossEncoder) models that score relevance between mixed-modality pairs. Key features include cross-modal similarity computation, encode_query/encode_document methods for retrieval tasks, a retrieve-and-rerank pipeline pattern, and support for multiple input formats (URLs, file paths, PIL images, numpy arrays). Supported models include Qwen3-VL-Embedding, NVIDIA Nemotron, Jina Reranker M0, and legacy CLIP models. Hardware requirements vary: VLM-based models need 8–20 GB VRAM, while CLIP models run on CPU.

#python

#rag

#vector-search

Apr 20•14m read time•From huggingface.co

Table of contents

Table of Contents What are Multimodal Models?Installation Multimodal Embedding Models Multimodal Reranker Models Retrieve and Rerank Input Formats and Configuration Supported Models Additional Resources

Comment

Bookmark

Copy

Sort: