Crowdsourced AI benchmarks like Chatbot Arena, which have become popular among AI labs, have serious flaws, some experts say.

TechCrunch (TC) is a leading technology news and media site that covers the latest trends, startups, and innovations in the tech industry. With breaking news,  analysis, and expert commentary, TechCrunch provides  insights into the world of technology and entrepreneurship. Developers can learn about emerging technologies, funding opportunities, and market trends by following TechCrunch's coverage of the tech industry.

TechCrunch

AI labs are increasingly using crowdsourced benchmarking platforms like Chatbot Arena to evaluate models, but experts argue these benchmarks have significant flaws. They criticize the lack of construct validity and allege that AI labs may exploit these benchmarks for exaggerated claims. Experts suggest more dynamic, diverse, and professional benchmarks, and emphasize compensating those who evaluate models. The importance of clear communication and multiple evaluation metrics in AI benchmarking is highlighted.

Crowdsourced AI benchmarks have serious flaws, some experts say