State-of-the-art LLMs can solve many scoped coding tasks, but can they execute end-to-end software projects? To find out, we built the Stripe integration benchmark: an agentic test of real API integration work in a production-realistic environment.

Stripe's blog serves as a repository of insights and knowledge regarding online payment systems and strategies for optimizing business performance in digital transactions. Covering a spectrum of topics from payment processing technologies to effective revenue management, it caters to entrepreneurs, developers, and businesses seeking to leverage online payment solutions efficiently. Through a combination of industry analysis, case studies, and best practices, the blog equips its audience with the necessary tools and understanding to navigate the complexities of digital commerce successfully.

Stripe

Stripe built a benchmark to evaluate whether AI agents can autonomously complete real-world Stripe API integrations end to end. The benchmark includes 11 diverse environments spanning backend-only tasks, full-stack tasks, and gym problem sets, using a goose-based agent harness with MCP tools for terminal, browser, and Stripe documentation access. Results showed Claude Opus 4.5 achieved 92% on full-stack tasks and GPT-5.2 scored 73% on gym sets. Agents surprised researchers by navigating UIs, debugging live issues, and handling underdocumented behavior. Key failure modes included mishandling ambiguous situations (e.g., treating 400 errors as success) and getting stuck during browser interactions. The benchmark is open-sourced to help the community improve agentic tooling for API integrations.

Can AI agents build real Stripe integrations? We built a benchmark to find out

How we constructed the Stripe integration benchmark

Looking ahead: The promise of benchmarking