Netflix's engineering team describes their LLM-as-a-Judge system for automatically evaluating show synopsis quality at scale. The system scores synopses across four quality dimensions (tone, clarity, precision, factuality) using a combination of techniques: per-criteria dedicated judges, tiered rationales (extended reasoning with concise summaries), consensus scoring (5× sampling with aggregation), and an Agents-as-a-Judge approach for factuality detection. The system achieves 85%+ agreement with expert creative writers. Validation against member behavior shows that higher LLM quality scores correlate with better take fraction and lower abandonment rates, confirming the system captures signals meaningful to real users. Key lessons include: single prompts evaluating multiple criteria hurt accuracy, longer rationales improve scoring but hurt readability (solved by tiered rationales), and reasoning models offer marginal gains at high cost.

10m read timeFrom netflixtechblog.com
Post cover image
Table of contents
IntroductionThe Making of a “Good” SynopsisScaling Quality Scoring with LLM-as-a-JudgeGet Netflix Technology Blog ’s stories in your inboxMember Validation of LLM-as-a-JudgeClosing Remarks

Sort: