Sentence Transformers v5.4 adds multimodal support, enabling encoding and comparison of text, images, audio, and video using the same familiar API. The update introduces multimodal embedding models that map different modalities into a shared vector space, and multimodal reranker (CrossEncoder) models that score relevance between mixed-modality pairs. Key features include cross-modal similarity computation, encode_query/encode_document methods for retrieval tasks, a retrieve-and-rerank pipeline pattern, and support for multiple input formats (URLs, file paths, PIL images, numpy arrays). Supported models include Qwen3-VL-Embedding, NVIDIA Nemotron, Jina Reranker M0, and legacy CLIP models. Hardware requirements vary: VLM-based models need 8–20 GB VRAM, while CLIP models run on CPU.
Table of contents
Table of ContentsWhat are Multimodal Models?InstallationMultimodal Embedding ModelsMultimodal Reranker ModelsRetrieve and RerankInput Formats and ConfigurationSupported ModelsAdditional ResourcesSort: