A thorough introduction to the Bradley-Terry model for learning probabilistic rankings from pairwise comparisons. Covers the core mathematical formulation (latent strength parameters, log-likelihood, gradient interpretation), fitting methods (gradient ascent, Newton methods, MM updates), and identifiability constraints. Extends into contextual Bradley-Terry (equivalent to logistic regression on feature differences), with application to LMSYS Chatbot Arena for LLM evaluation. Also covers CrowdBT for handling noisy annotators via EM-based joint estimation of item strengths and annotator reliabilities, plus Bayesian extensions like TrueSkill.
Table of contents
A Simple ExampleFitting the Model From DataA Deeper Look at Bradley-Terry Model FittingFrom Local Judgments to Global StructureWhy Pairwise Comparisons Are Often Better Than Direct ScoresGoing Deeper: Identifiability, Curvature, and OptimizationContextual Bradley-Terry: When Strength Depends on SettingAccounting for Noisy Raters: When Not All Comparisons Are EqualSummaryFurther ReadingSort: