QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

QIMMA (Arabic for 'summit') is a new Arabic LLM leaderboard that applies a rigorous quality validation pipeline to benchmarks before evaluating any models. It consolidates 109 subsets from 14 source benchmarks into 52,000+ samples across 7 domains including cultural, STEM, legal, medical, safety, poetry, and coding. The pipeline uses two LLMs (Qwen3-235B and DeepSeek-V3) scoring samples on a 10-point rubric, with human review for borderline cases. Key findings include systematic quality issues in widely-used Arabic benchmarks (ArabicMMLU had a 3.1% discard rate), and 81-88% of Arabic coding benchmark prompts required refinement. Among 46 evaluated models, Jais-2-70B-Chat leads overall (65.81), narrowly ahead of Qwen2.5-72B-Instruct (65.75). Arabic-specialized models struggle with coding tasks, while multilingual models perform competitively. QIMMA is the first Arabic leaderboard combining open-source code, native Arabic content, quality validation, code evaluation, and public per-sample outputs.

#llm

#nlp

Apr 21•9m read time•From huggingface.co

Table of contents

🔍 The Problem: Arabic NLP Evaluation Is Fragmented and Unvalidated ⛰ What's in QIMMA?🔬 The Quality Validation Pipeline ⚠️ What We Found: Systematic Quality Problems 💻 Code Benchmark: A Different Kind of Quality Work ⚙️ Evaluation Setup 🏆 Leaderboard Results 🌟 What Makes QIMMA Different 🔗 Resources 🔖 Citation

Comment

Bookmark

Copy

Sort: