New research looks at how leading AI models hold up doing actual white-collar work tasks, drawn from consulting, investment banking, and law. Most models failed.

TechCrunch (TC) is a leading technology news and media site that covers the latest trends, startups, and innovations in the tech industry. With breaking news,  analysis, and expert commentary, TechCrunch provides  insights into the world of technology and entrepreneurship. Developers can learn about emerging technologies, funding opportunities, and market trends by following TechCrunch's coverage of the tech industry.

TechCrunch

New research from Mercor introduces Apex-Agents, a benchmark testing AI models on real white-collar tasks from consulting, investment banking, and law. Leading models achieved only 24% accuracy at best, with Gemini 3 Flash and GPT-5.2 performing strongest. The main challenge is multi-domain reasoning across tools like Slack and Google Drive, which professionals use daily. Tasks require synthesizing information from multiple sources and applying domain-specific knowledge, like EU privacy law. While current results show AI agents aren't ready to replace knowledge workers, rapid year-over-year improvements suggest this gap may narrow quickly.

Are AI agents ready for the workplace? A new benchmark raises doubts.