Sparse vs Dense Vectors: How Lexical and Semantic Search Actually Work

Sparse and dense vectors each have distinct strengths and failure modes in modern search systems. Sparse vectors (BM25, SPLADE, LACONIC) excel at exact-match retrieval, offer full explainability, and require no GPU at query time, but fail on synonyms and semantic understanding. Dense vectors capture meaning across vocabulary gaps but struggle with exact precision, cost significant memory at scale (~60GB for 10M docs at 1536 dims), and produce opaque similarity scores. Techniques like HyDE (generating hypothetical answers before embedding) can boost retrieval by ~38% for ambiguous queries. As LLMs increasingly generate search queries using paraphrased, conceptual language, pure keyword search degrades. Research shows that reaching 0.98 recall@1000 requires combining both approaches — hybrid search improves recall 15-30% over either method alone. Learned sparse models like LACONIC now compete with dense retrievers on benchmarks while using 71% less index memory.

#elk

#rag

#vector-search

Mar 06•10m read time•From bigdataboutique.com

Table of contents

Sparse Vectors: From BM25 to Learned Representations Dense Vectors: Semantic Power at a Cost Where Each Approach Fails Key Takeaways

Comment

Bookmark

Copy

Sort: