Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher.

TechCrunch (TC) is a leading technology news and media site that covers the latest trends, startups, and innovations in the tech industry. With breaking news,  analysis, and expert commentary, TechCrunch provides  insights into the world of technology and entrepreneurship. Developers can learn about emerging technologies, funding opportunities, and market trends by following TechCrunch's coverage of the tech industry.

TechCrunch

Researchers from Hao AI Lab at the University of California San Diego tested AI models using Super Mario Bros. as a benchmark. Anthropic’s Claude 3.7 outperformed other models, while reasoning models like OpenAI’s GPT-4o struggled due to slower decision-making abilities. The game was modified to run in an emulator with the GamingAgent framework, which provided instructions to the AI. This study adds to the ongoing debate about the effectiveness of using games for AI benchmarking.