Zalando built a search quality assurance framework using LLM-as-a-judge to evaluate search relevance at scale, enabling pre-launch validation for new markets (Luxembourg, Portugal, Greece). The system clusters search queries by NER tags to group similar search intents, translates them to target languages via LLM, then evaluates
Table of contents
Real-world use case: Launching a new countryData-Driven Approach with LLM-as-a-judgeSelection of Test QueriesHow Does Search Quality Evaluation Work?Production time: The Evaluation PipelinesResultsCost of evaluationBottom LineSort: