Evaluating Netflix Show Synopses with LLM-as-a-Judge

Netflix's engineering team describes their LLM-as-a-Judge system for automatically evaluating show synopsis quality at scale. The system scores synopses across four quality dimensions (tone, clarity, precision, factuality) using a combination of techniques: per-criteria dedicated judges, tiered rationales (extended reasoning with concise summaries), consensus scoring (5× sampling with aggregation), and an Agents-as-a-Judge approach for factuality detection. The system achieves 85%+ agreement with expert creative writers. Validation against member behavior shows that higher LLM quality scores correlate with better take fraction and lower abandonment rates, confirming the system captures signals meaningful to real users. Key lessons include: single prompts evaluating multiple criteria hurt accuracy, longer rationales improve scoring but hurt readability (solved by tiered rationales), and reasoning models offer marginal gains at high cost.

#llm

#netflix

#prompt-engineering

Apr 10•10m read time•From netflixtechblog.com

Table of contents

Introduction The Making of a “Good” Synopsis Scaling Quality Scoring with LLM-as-a-Judge Get Netflix Technology Blog ’s stories in your inbox Member Validation of LLM-as-a-Judge Closing Remarks

Comment

Bookmark

Copy

Sort: