Instacart rebuilt their query understanding system using LLMs to better handle long-tail searches and ambiguous queries. The team progressed from context-engineering with RAG to fine-tuning smaller models like Llama-3-8B, consolidating multiple specialized models into a unified system. They implemented a hybrid architecture: an offline pipeline generates high-quality training data and caches results for common queries, while a fine-tuned real-time model handles rare searches. Through adapter merging, GPU optimization, and quantization experiments, they reduced latency from 700ms to 300ms while improving search quality metrics by 6% for tail queries.

13m read timeFrom tech.instacart.com
Post cover image
Table of contents
IntroductionChallenges in Traditional Query UnderstandingThe Advantages of LLMsLLM as QU: Our Strategy in Action1. Query Category Classification2. Query Rewrites3. Semantic Role Labeling (SRL)Get Yuanzheng Zhu’s stories in your inboxBuilding a New Foundation: Fine-Tuning for Real-Time InferenceDistilling Knowledge via Fine-TuningThe Path to Production: Taming Real-Time LatencyKey Takeaways

Sort: