DoorDash has built DashCLIP, a multimodal ML system that aligns product images, text descriptions, and user search queries in a shared embedding space using contrastive learning. Trained on ~32 million labeled query-product pairs (700K human-annotated, expanded via GPT-based labeling), the system uses separate encoders for images, text, and queries. A two-stage pipeline first adapts pretrained vision-language models to the e-commerce domain, then aligns query and product embeddings using Query Catalog Contrastive (QCC) loss. In production, query embeddings power K-nearest neighbor retrieval feeding downstream ranking models. DashCLIP outperformed baseline models like CLIP, BLIP, and FLAVA in offline evaluations and improved engagement metrics in A/B tests. The embeddings also generalize to tasks like aisle category prediction and product relevance classification.

3m read timeFrom infoq.com
Post cover image

Sort: