Text Analysis for Hybrid Search: Tokenization, Stopwords & Accent Folding

Hybrid search depends equally on vector similarity and BM25 keyword scoring, but the BM25 half silently breaks when the text analyzer is misconfigured. Weaviate v1.37 introduces observable, per-property tokenization with four general-purpose methods (word, lowercase, whitespace, field), language-specific tokenizers for CJK/Korean, accent folding for multilingual Latin-script matching, and per-property stopword presets that take effect without reindexing. A new /v1/tokenize REST endpoint lets developers inspect exactly what tokens are written to the inverted index versus what BM25 scores at query time, acting as a linter for the analyzer pipeline. Practical use cases include multilingual e-commerce catalogs, technical documentation RAG, and multi-tenant SaaS with mixed locales.

#nlp

#weaviate

May 14•11m read time•From weaviate.io

Table of contents

Hybrid search 101 Tokenization methods Accent folding for multilingual search Per-property stopwords The tokenize endpoint Use cases Summary Ready to start building? Don't want to miss another blog post?

Comment

Bookmark

Copy

Sort: