A practical guide to production RAG retrieval pipelines in 2026, covering how to choose embedding models based on training data and query shape rather than benchmark scores alone. Key patterns include: understanding what embeddings encode (and fail to encode), a three-tier model selection framework (frontier APIs, open-weights, specialized), Matryoshka embeddings for dimension flexibility, hybrid search combining dense and BM25 as a non-optional default, and cross-encoder rerankers for precision. The post details cost/latency budgets at each pipeline stage, tuning methodology (one variable at a time against a fixed eval set), and multilingual/multimodal considerations. The recommended production stack is open-weights bi-encoder + hybrid search with reciprocal rank fusion + small cross-encoder reranker on top-50 candidates, with evals built from real user queries.
Table of contents
What An Embedding Model Actually EncodesThe Embedding Model Choice In Three TiersEmbedding Dimension And The Cost CurveHybrid Search Is Not OptionalWhat A Reranker Does, And Why You Probably Need OnePicking A RerankerCost And Latency BudgetsMultilingual, Multimodal, And The Rest Of The Long TailHow To Tune The Pipeline Without Breaking ItWhat I Would Build From ScratchSort: