A practical guide to training and finetuning multimodal embedding and reranker models using the Sentence Transformers library. The post walks through finetuning Qwen3-VL-Embedding-2B for Visual Document Retrieval (VDR), covering all training components: model loading, dataset preparation with image-text triplets, CachedMultipleNegativesRankingLoss with gradient caching, MatryoshkaLoss for flexible embedding dimensions, evaluation with InformationRetrievalEvaluator, and the SentenceTransformerTrainer. The finetuned 2B model achieves NDCG@10 of 0.947 vs. the base model's 0.888, outperforming all tested VDR models including ones up to 4x larger. The post also covers multimodal reranker training using CrossEncoderTrainer and discusses Router-based architectures for combining separate modality encoders.

18m read timeFrom huggingface.co
Post cover image
Table of contents
Table of ContentsWhy Finetune?Training ComponentsModelDatasetLoss FunctionTraining ArgumentsEvaluatorTrainerResultsTraining Multimodal Reranker ModelsAdditional Resources

Sort: