A comparison of two approaches to evaluating search quality: query-based evaluation (aggregating clicks per query into relevance labels) vs. session-based evaluation (replaying individual user sessions). Session-based eval offers better sampling accuracy by treating each user interaction equally, similar to probability-based polling, and preserves time-sensitive features like dynamic pricing for learning-to-rank training. However, it sacrifices per-query debuggability. The post recommends using both approaches for different purposes: session-based for simulated A/B testing and query-based for diagnosing specific query failures.
Sort: