Hybrid search depends equally on vector similarity and BM25 keyword scoring, but the BM25 half silently breaks when the text analyzer is misconfigured. Weaviate v1.37 introduces observable, per-property tokenization with four general-purpose methods (word, lowercase, whitespace, field), language-specific tokenizers for CJK/Korean, accent folding for multilingual Latin-script matching, and per-property stopword presets that take effect without reindexing. A new /v1/tokenize REST endpoint lets developers inspect exactly what tokens are written to the inverted index versus what BM25 scores at query time, acting as a linter for the analyzer pipeline. Practical use cases include multilingual e-commerce catalogs, technical documentation RAG, and multi-tenant SaaS with mixed locales.
Table of contents
Hybrid search 101 Tokenization methods Accent folding for multilingual search Per-property stopwords The tokenize endpoint Use cases Summary Ready to start building? Don't want to miss another blog post?Sort: