DoorDash has launched a multimodal machine learning system that aligns product images, text, and user queries in a shared embedding space. Trained on 32 million labeled query-product pairs using contr

InfoQ is a leading online platform for software developers, architects, and technical leaders, providing news, articles, presentations, and interviews on a wide range of topics, including agile practices, DevOps, microservices, and emerging technologies. With a focus on quality content and expert insights, InfoQ helps professionals stay informed about the latest trends, best practices, and industry developments. Developers can learn from real-world experiences, gain  knowledge, and connect with peers in the global software community through InfoQ's diverse and engaging content.

InfoQ

DoorDash has built DashCLIP, a multimodal ML system that aligns product images, text descriptions, and user search queries in a shared embedding space using contrastive learning. Trained on ~32 million labeled query-product pairs (700K human-annotated, expanded via GPT-based labeling), the system uses separate encoders for images, text, and queries. A two-stage pipeline first adapts pretrained vision-language models to the e-commerce domain, then aligns query and product embeddings using Query Catalog Contrastive (QCC) loss. In production, query embeddings power K-nearest neighbor retrieval feeding downstream ranking models. DashCLIP outperformed baseline models like CLIP, BLIP, and FLAVA in offline evaluations and improved engagement metrics in A/B tests. The embeddings also generalize to tasks like aisle category prediction and product relevance classification.

DoorDash Builds DashCLIP to Align Images, Text, and Queries for Semantic Search Using 32M Labels