Consider these two queries: “What is Tom Cruise’s height,” and “How tall is Maverick from Top Gun?” They mean the same thing but lack shared words. How could a computer recognize these as similar…

Snowflake Community is a platform for users of the Snowflake cloud data platform to share knowledge, ask questions, and collaborate. Readers can learn about cloud data warehousing, data analytics, and data engineering best practices. With forums, user groups, and community events, Snowflake Community provides resources for Snowflake users to connect and learn from each other.

Snowflake Community

This post explores the history and advancements in text embedding models, including the tricks used to build state-of-the-art models and the role of contrastive loss in fine-tuning. It also highlights the importance of standardized evaluation and provides insights into the training recipe for top-performing models.

How to Build a State-of-the-art Text Embedding Model

A (very not comprehensive) history of embeddings

Language-model-based embeddings have taken over information retrieval

The modern state-of-the-art recipe for building embeddings

Trick 1: Start with a pre-trained general-purpose language model

Trick 2: Fine-tune for information retrieval with contrastive loss

Trick 4: Scale to large batch sizes to optimally leverage in-batch negatives

Trick 5: Finish training with some hard negatives