QIMMA (Arabic for 'summit') is a new Arabic LLM leaderboard that applies a rigorous quality validation pipeline to benchmarks before evaluating any models. It consolidates 109 subsets from 14 source benchmarks into 52,000+ samples across 7 domains including cultural, STEM, legal, medical, safety, poetry, and coding. The pipeline uses two LLMs (Qwen3-235B and DeepSeek-V3) scoring samples on a 10-point rubric, with human review for borderline cases. Key findings include systematic quality issues in widely-used Arabic benchmarks (ArabicMMLU had a 3.1% discard rate), and 81-88% of Arabic coding benchmark prompts required refinement. Among 46 evaluated models, Jais-2-70B-Chat leads overall (65.81), narrowly ahead of Qwen2.5-72B-Instruct (65.75). Arabic-specialized models struggle with coding tasks, while multilingual models perform competitively. QIMMA is the first Arabic leaderboard combining open-source code, native Arabic content, quality validation, code evaluation, and public per-sample outputs.
Table of contents
🔍 The Problem: Arabic NLP Evaluation Is Fragmented and Unvalidated⛰ What's in QIMMA?🔬 The Quality Validation Pipeline⚠️ What We Found: Systematic Quality Problems💻 Code Benchmark: A Different Kind of Quality Work⚙️ Evaluation Setup🏆 Leaderboard Results🌟 What Makes QIMMA Different🔗 Resources🔖 CitationSort: