A research paper from Princeton introduces a framework for measuring AI agent reliability, decomposing it into 12 metrics across four dimensions: consistency, robustness, predictability, and safety. Testing 14 models from OpenAI, Google, and Anthropic across 500 benchmark runs, the study finds that while accuracy has improved substantially over 18 months, reliability gains have been modest and appear to be an industry-wide limitation. Key findings include that agents frequently fail on repeated identical tasks (consistency scores 30–75%), are poor at knowing when they're wrong (predictability is the weakest dimension), and that larger models aren't uniformly more reliable. The authors argue this capability-reliability gap helps explain why AI agents haven't yet produced the expected economic impacts, and call for reliability profiles to be reported alongside accuracy metrics.
Table of contents
Table of ContentsAccuracy isn’t enough: four dimensions of reliabilityCapability gains are rapid, but improvements in reliability are modestWhy we could be wrongWhat should deployers do differently?What should researchers and developers do differently?What do our findings mean for AI progress?Further readingSort: