DiffuJudge-AV is a framework that treats LLM/VLM judge scores as noisy sensor readings and applies a diffusion-inspired denoising approach to produce calibrated, uncertainty-aware evaluations for autonomous driving video QA. The key insight is that Pearson correlation can mask critical failures: a text-only Claude judge achieved r=0.753 but Cohen's κ=0.057, meaning it compressed nearly all scores to the middle of the scale and caught only 2% of safety-critical failures. By running judges through 7 known bias perturbations (position swap, rubric paraphrase, score-ID format, temperature, etc.) and applying Tweedie's formula for denoising, the framework produces calibrated uncertainty intervals and routes ambiguous cases to human review. The best-performing judge was Qwen2.5-VL-7B (open, 7B parameters), outperforming closed models with κ=0.837 and fail-detection F1=0.712. A key finding: adding visual frames to Claude dramatically expanded its scoring range from [1.3, 3.5] to [1.0, 5.0], enabling it to actually flag failures. The framework also produces per-model bias heatmaps showing which perturbation sources each judge is most sensitive to.

19m read timeFrom towardsdatascience.com
Post cover image
Table of contents
Why “evaluation of evaluation”?The intuition: a judge score is a noisy sensor readingThe denoising step: Tweedie in one equationThe problem domain and the dataPipelineWhere this slots into NVIDIA’s AV-Eval stackResult: Pearson correlation hid the failure modeResult: vision changed Claude’s scoring behaviorResult: vision unlocks safety-threshold decisionsResult: a single heatmap of judge bias per noise sourceResult: does the uncertainty have signal?Result: stochastic stability hit the original targetResult: conformal coverage matches the calibration targetLimitationsConclusionFuture workReferences

Sort: