Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

A practical guide to training and finetuning multimodal embedding and reranker models using the Sentence Transformers library. The post walks through finetuning Qwen3-VL-Embedding-2B for Visual Document Retrieval (VDR), covering all training components: model loading, dataset preparation with image-text triplets, CachedMultipleNegativesRankingLoss with gradient caching, MatryoshkaLoss for flexible embedding dimensions, evaluation with InformationRetrievalEvaluator, and the SentenceTransformerTrainer. The finetuned 2B model achieves NDCG@10 of 0.947 vs. the base model's 0.888, outperforming all tested VDR models including ones up to 4x larger. The post also covers multimodal reranker training using CrossEncoderTrainer and discusses Router-based architectures for combining separate modality encoders.

#python

Apr 20•18m read time•From huggingface.co

Table of contents

Table of Contents Why Finetune?Training Components Model Dataset Loss Function Training Arguments Evaluator Trainer Results Training Multimodal Reranker Models Additional Resources

Comment

Bookmark

Copy

Sort: