New Paper: Towards a science of AI agent reliability

A research paper from Princeton introduces a framework for measuring AI agent reliability, decomposing it into 12 metrics across four dimensions: consistency, robustness, predictability, and safety. Testing 14 models from OpenAI, Google, and Anthropic across 500 benchmark runs, the study finds that while accuracy has improved substantially over 18 months, reliability gains have been modest and appear to be an industry-wide limitation. Key findings include that agents frequently fail on repeated identical tasks (consistency scores 30–75%), are poor at knowing when they're wrong (predictability is the weakest dimension), and that larger models aren't uniformly more reliable. The authors argue this capability-reliability gap helps explain why AI agents haven't yet produced the expected economic impacts, and call for reliability profiles to be reported alongside accuracy metrics.

#ai-agents

#ai-safety

#llm

Feb 24•11m read time•From normaltech.ai

Table of contents

Table of Contents Accuracy isn’t enough: four dimensions of reliability Capability gains are rapid, but improvements in reliability are modest Why we could be wrong What should deployers do differently?What should researchers and developers do differently?What do our findings mean for AI progress?Further reading

Comment

Bookmark

Copy

Sort: