Computer use is 45x More Expensive Than Structured APIs
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
A benchmark comparing two approaches for AI agents operating an admin panel: a vision agent (browser-use with Claude Sonnet) versus an API agent calling structured HTTP endpoints. The vision agent required 53 steps and ~551k tokens averaging 17 minutes per run, while the API agent completed the same task in 8 calls and ~12k tokens in under 20 seconds — roughly 45x cheaper. The vision agent also required a detailed 14-step UI walkthrough prompt to complete the task correctly, whereas the API agent succeeded on a plain task description. Key finding: the cost gap is architectural, not a model quality issue. Better vision models reduce per-step errors but not step count, which is determined by the interface. The benchmark was enabled by Reflex 0.9's auto-generated HTTP endpoints from event handlers, eliminating the need to write a separate API layer for internal tools.
Table of contents
Why vision agents?The setupThe vision agent couldn't complete the taskWith a 14-step walkthrough, it succeededHow we ran itThe full resultsThe structural gapHow we justify the API engineering costNotesReproduce itSort: