Zalando built a search quality assurance framework using LLM-as-a-judge to evaluate search relevance at scale, enabling pre-launch validation for new markets (Luxembourg, Portugal, Greece). The system clusters search queries by NER tags to group similar search intents, translates them to target languages via LLM, then evaluates search results using GPT-4o scoring relevance on a 0–4 scale. Pipelines run on Apache Airflow with Kubernetes PodOperators and an Elasticache layer to avoid redundant product data fetches. A full evaluation of 1,500 search segments with 25 results each costs ~$250 and takes 3–5 hours. The framework proactively identified NER lemmatization bugs, unrecognized Portuguese and Greek terms, and undiscoverable product categories before go-live, replacing a slow, reactive manual review process.

14m read timeFrom engineering.zalando.com
Post cover image
Table of contents
Real-world use case: Launching a new countryData-Driven Approach with LLM-as-a-judgeSelection of Test QueriesHow Does Search Quality Evaluation Work?Production time: The Evaluation PipelinesResultsCost of evaluationBottom Line

Sort: