DiffuJudge-AV: A Diffusion-Inspired Framework for Calibrated AV Video Evaluation

DiffuJudge-AV is a framework that treats LLM/VLM judge scores as noisy sensor readings and applies a diffusion-inspired denoising approach to produce calibrated, uncertainty-aware evaluations for autonomous driving video QA. The key insight is that Pearson correlation can mask critical failures: a text-only Claude judge achieved r=0.753 but Cohen's κ=0.057, meaning it compressed nearly all scores to the middle of the scale and caught only 2% of safety-critical failures. By running judges through 7 known bias perturbations (position swap, rubric paraphrase, score-ID format, temperature, etc.) and applying Tweedie's formula for denoising, the framework produces calibrated uncertainty intervals and routes ambiguous cases to human review. The best-performing judge was Qwen2.5-VL-7B (open, 7B parameters), outperforming closed models with κ=0.837 and fail-detection F1=0.712. A key finding: adding visual frames to Claude dramatically expanded its scoring range from [1.3, 3.5] to [1.0, 5.0], enabling it to actually flag failures. The framework also produces per-model bias heatmaps showing which perturbation sources each judge is most sensitive to.

#vlm

Yesterday•19m read time•From towardsdatascience.com

Table of contents

Why “evaluation of evaluation”?The intuition: a judge score is a noisy sensor reading The denoising step: Tweedie in one equation The problem domain and the data Pipeline Where this slots into NVIDIA’s AV-Eval stack Result: Pearson correlation hid the failure mode Result: vision changed Claude’s scoring behavior Result: vision unlocks safety-threshold decisions Result: a single heatmap of judge bias per noise source Result: does the uncertainty have signal?Result: stochastic stability hit the original target Result: conformal coverage matches the calibration target Limitations Conclusion Future work References

Comment

Bookmark

Copy

Sort: