Search Quality Assurance with AI as a Judge

Zalando built a search quality assurance framework using LLM-as-a-judge to evaluate search relevance at scale, enabling pre-launch validation for new markets (Luxembourg, Portugal, Greece). The system clusters search queries by NER tags to group similar search intents, translates them to target languages via LLM, then evaluates search results using GPT-4o scoring relevance on a 0–4 scale. Pipelines run on Apache Airflow with Kubernetes PodOperators and an Elasticache layer to avoid redundant product data fetches. A full evaluation of 1,500 search segments with 25 results each costs ~$250 and takes 3–5 hours. The framework proactively identified NER lemmatization bugs, unrecognized Portuguese and Greek terms, and undiscoverable product categories before go-live, replacing a slow, reactive manual review process.

#nlp

#gpt

#apache-airflow

Mar 17•14m read time•From engineering.zalando.com

Table of contents

Real-world use case: Launching a new country Data-Driven Approach with LLM-as-a-judge Selection of Test Queries How Does Search Quality Evaluation Work?Production time: The Evaluation Pipelines Results Cost of evaluation Bottom Line

Comment

Bookmark

Copy

Sort: