DoorDash built a natural language search (NLS) system that handles vague, multi-constraint queries like 'cozy date night dinner.' Evaluating such a system proved harder than building it — traditional metrics were noisy and human annotation was slow and inconsistent. An audit of 6,824 query-store judgments revealed that human labels were wrong in over half of reviewed cases, with disagreement concentrated at boundary relevance levels. The root cause was a rubric that forced annotators to improvise on subjective, multi-faceted queries. The solution was to decompose relevance into independent binary facets (e.g., 'Does this store serve tacos?' 'Does it offer items under $12?'), calibrate an LLM judge against adjudicated human consensus, and automate execution via daily monitoring and PR-level guardrails. Per-facet NDCG evaluation exposed hidden failures in speed and geolocation that aggregate scores masked. Key lessons include starting with binary relevance, treating the rubric as a versioned product artifact, and investing early in context completeness (item-level pricing, customization flags, display logic).

19m read timeFrom careersatdoordash.com
Post cover image
Table of contents
Motivation: When manual labeling becomes the bottleneckDiagnosis: Label noise and rubric–intent mismatchRoot cause: Random error and rubric–intent mismatchIdentifying binary facets key to designStay Informed with Weekly UpdatesPlease enter a valid email address.Thank you for Subscribing!Architecture in three phasesWhen the evaluation brokePractical advice for adopting LLM-as-a-judgeWhat we would do differentlyLooking forwardAcknowledgementsReferences

Sort: