LLM-as-a-Judge: Evaluating natural language search

DoorDash built a natural language search (NLS) system that handles vague, multi-constraint queries like 'cozy date night dinner.' Evaluating such a system proved harder than building it — traditional metrics were noisy and human annotation was slow and inconsistent. An audit of 6,824 query-store judgments revealed that human labels were wrong in over half of reviewed cases, with disagreement concentrated at boundary relevance levels. The root cause was a rubric that forced annotators to improvise on subjective, multi-faceted queries. The solution was to decompose relevance into independent binary facets (e.g., 'Does this store serve tacos?' 'Does it offer items under $12?'), calibrate an LLM judge against adjudicated human consensus, and automate execution via daily monitoring and PR-level guardrails. Per-facet NDCG evaluation exposed hidden failures in speed and geolocation that aggregate scores masked. Key lessons include starting with binary relevance, treating the rubric as a versioned product artifact, and investing early in context completeness (item-level pricing, customization flags, display logic).

May 14•19m read time•From careersatdoordash.com

Table of contents

Motivation: When manual labeling becomes the bottleneck Diagnosis: Label noise and rubric–intent mismatch Root cause: Random error and rubric–intent mismatch Identifying binary facets key to design Stay Informed with Weekly Updates Please enter a valid email address.Thank you for Subscribing!Architecture in three phases When the evaluation broke Practical advice for adopting LLM-as-a-judge What we would do differently Looking forward Acknowledgements References

Comment

Bookmark

Copy

Sort: