A framework for validating LLM-as-a-judge systems when rating tasks have multiple correct answers (rating indeterminacy). The approach uses response set elicitation instead of forced-choice ratings, aggregates disagreement into multi-label vectors, and measures human-judge agreement with continuous metrics like MSE. Experiments across nine commercial LLMs and eleven rating tasks show that traditional forced-choice metrics select suboptimal judge systems, while the proposed multi-label approach correctly identifies high-performing judges for downstream tasks like content filtering and prevalence estimation.

14m read timeFrom blog.ml.cmu.edu
Post cover image
Table of contents
A Framework for Meta-Evaluation under Rating IndeterminacyEmpirical ValidationPractical Takeaways

Sort: