CHAI (Critique-based Human-AI Oversight) is a video captioning pipeline developed at CMU with 100+ professional filmmakers, presented as a CVPR 2026 Highlight. The core insight is that VLMs fail at cinematic prompts not due to model capacity but because training captions lack precise filmmaking vocabulary. The pipeline uses a model-draft, human-critique, model-revise loop that exploits asymmetric strengths: LLMs write fluent prose while trained humans catch grounded visual errors. A structured spec covering ~200 cinematic primitives (shot size, camera movement, focus type, etc.) was built with cinematographers and DPs. Ablations show critique quality (accuracy, completeness, constructiveness) critically impacts downstream performance — an 8B Qwen3-VL post-trained on CHAI data matches GPT-5 and Gemini-3.1-Pro on captioning benchmarks. Re-captioning a video corpus with the improved model and fine-tuning Wan2.2 on those captions produced a generator that correctly executes techniques like dolly zooms and rack focus pulls. The key lesson: scaling supervision quality beats scaling model size.
Table of contents
How close is today's video generator to a Hollywood cinematographer?Question 1: Why do VLMs struggle with cinematic prompts?Question 2: How should humans and models divide the captioning work?Question 3: Does the quality of human critique change what the model can learn?Question 4: Do better captions in the training data give us a better video generator?DiscussionResourcesReferencesSort: