Pass@k is Mostly Bunk

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

The pass@k metric, commonly used to evaluate AI agents, is fundamentally flawed because it's exponentially forgiving. While it measures the probability that at least one of k attempts succeeds, this creates misleadingly high success rates even for poor-performing models. A model with only 5% success rate can show 99.4% pass@100. This doesn't reflect real-world usage where humans expect consistent success across multiple steps, not just one success out of many attempts. The metric should only be used in rare cases with simple tasks, reliable evaluators, and no human interaction, and requires careful justification each time.

2m read timeFrom brooker.co.za
Post cover image

Sort: