A framework for validating LLM-as-a-judge systems when rating tasks have multiple correct answers (rating indeterminacy). The approach uses response set elicitation instead of forced-choice ratings, aggregates disagreement into multi-label vectors, and measures human-judge agreement with continuous metrics like MSE. Experiments
Table of contents
A Framework for Meta-Evaluation under Rating IndeterminacyEmpirical ValidationPractical TakeawaysSort: