Sparse and dense vectors each have distinct strengths and failure modes in modern search systems. Sparse vectors (BM25, SPLADE, LACONIC) excel at exact-match retrieval, offer full explainability, and require no GPU at query time, but fail on synonyms and semantic understanding. Dense vectors capture meaning across vocabulary gaps but struggle with exact precision, cost significant memory at scale (~60GB for 10M docs at 1536 dims), and produce opaque similarity scores. Techniques like HyDE (generating hypothetical answers before embedding) can boost retrieval by ~38% for ambiguous queries. As LLMs increasingly generate search queries using paraphrased, conceptual language, pure keyword search degrades. Research shows that reaching 0.98 recall@1000 requires combining both approaches — hybrid search improves recall 15-30% over either method alone. Learned sparse models like LACONIC now compete with dense retrievers on benchmarks while using 71% less index memory.

10m read timeFrom bigdataboutique.com
Post cover image
Table of contents
Sparse Vectors: From BM25 to Learned RepresentationsDense Vectors: Semantic Power at a CostWhere Each Approach FailsKey Takeaways

Sort: