A framework for validating LLM-as-a-judge systems when rating tasks have multiple correct answers (rating indeterminacy). The approach uses response set elicitation instead of forced-choice ratings, aggregates disagreement into multi-label vectors, and measures human-judge agreement with continuous metrics like MSE. Experiments

14m read timeFrom blog.ml.cmu.edu
Post cover image
Table of contents
A Framework for Meta-Evaluation under Rating IndeterminacyEmpirical ValidationPractical Takeaways

Sort: