TheRegister's platform is a leading technology news website, offering insights into IT industry news, hardware reviews, and software updates. Through articles, analysis, and opinion pieces, TheRegister offers insights into cybersecurity threats, technology trends, and industry developments. Readers can stay updated with the latest news and analysis from the world of technology and IT business.

The Register

The ORCA Benchmark, consisting of 500 practical math questions, evaluated a new round of leading LLMs including ChatGPT 5.2, Gemini 3 Flash, Grok 4.1, and DeepSeek V3.2. Gemini 3 Flash led with 72.8% accuracy, while others scored between 54–60%. All models improved except Grok 4.1, which regressed. A key finding is that calculation errors now account for 39.8% of all mistakes, up from 33.4%, while models have gotten better at making math look correct through formatting. Researchers attribute the persistent failures to LLMs being prediction engines rather than logic engines, essentially pattern-matching rather than truly calculating. Potential mitigations include function calling to offload arithmetic to deterministic systems, and formal proof verification using tools like Lean.

AI models get better at math but still get low marks