unknown

People are quickly misinterpreting this graph for hype

Here's a clear explanation on what's happening:

METR builds a benchmark of software tasks, (debugging complex systems, training ML models, or finding security vulnerabilities) 

They measure how long each task takes a skilled human expert to complete, then test AI agents on those same tasks.

The "time horizon" is a summary statistic: the task length at which a given AI succeeds 50% of the time. A model with a 2-hour time horizon completes half of the tasks that would take a human expert 2 hours.

METR just reported that Claude Opus 4.6 has a 50% time horizon of ~14.5 hours, which of course, would be incredibly impressive...

but METR is telling us to be cautious!

There is a statistical problem. There simply aren't enough hard tasks left to anchor the upper end of the curve and frontier models are now succeeding at nearly everything in the task suite.

So, small random variations in the results are swinging the estimate dramatically:

The 95% confidence interval spans from 6 hours to 98 hours, which is clearly an unreliable range for anyone to draw conclusions from.

METR themselves are working on new methods to measure at this level, so taper expectations just a bit :)

Peter Steinberger 🦞

A clarification post warning against misinterpreting a METR benchmark graph that's being hyped. The author explains what the graph actually shows about AI agent performance, cautioning against overblown conclusions drawn from the data.