Hybrid search depends equally on vector similarity and BM25 keyword scoring, but the BM25 half silently breaks when the text analyzer is misconfigured. Weaviate v1.37 introduces observable, per-property tokenization with four general-purpose methods (word, lowercase, whitespace, field), language-specific tokenizers for CJK/Korean, accent folding for multilingual Latin-script matching, and per-property stopword presets that take effect without reindexing. A new /v1/tokenize REST endpoint lets developers inspect exactly what tokens are written to the inverted index versus what BM25 scores at query time, acting as a linter for the analyzer pipeline. Practical use cases include multilingual e-commerce catalogs, technical documentation RAG, and multi-tenant SaaS with mixed locales.

11m read timeFrom weaviate.io
Post cover image
Table of contents
Hybrid search 101 ​Tokenization methods ​Accent folding for multilingual search ​Per-property stopwords ​The tokenize endpoint ​Use cases ​Summary ​Ready to start building? ​Don't want to miss another blog post?

Sort: