Comprehensive benchmarking study comparing embedding models for hybrid search across multiple dimensions: model quantization (FP32, FP16, INT8), vector precision (float, bfloat16, binary), Matryoshka dimensions, and hardware platforms (Graviton3/4, T4 GPU). Key findings include INT8 quantization providing 2.7-3.4x CPU speedup with 94-98% quality retention, binary vectors achieving 32x memory reduction with minimal quality loss on ModernBERT models, and hybrid retrieval consistently outperforming semantic-only search by 3-5 percentage points. Includes interactive leaderboard with NanoBEIR evaluation results and practical Vespa configuration examples for implementing these optimizations in production.

9m read timeFrom blog.vespa.ai
Post cover image
Table of contents
What MTEB doesn’t show youInteractive leaderboardGetting started with VespaA few caveatsWrapping up

Sort: