A Blog post by IBM Research on Hugging Face

HuggingFace's platform is a resource for developers and researchers working in natural language processing (NLP) and machine learning, offering insights into NLP models, tools, and datasets. Through articles, tutorials, and open-source projects, HuggingFace offers insights into state-of-the-art NLP techniques, transformer architectures, and transfer learning methods. Developers can learn about using pre-trained models, fine-tuning strategies, and deploying NLP applications with HuggingFace's libraries and APIs.

Hugging Face

ITBench-AA is a new benchmark for evaluating AI agents on enterprise IT/SRE tasks, developed by Artificial Analysis and IBM. All frontier models score below 50%, with Claude Opus 4.7 leading at 47%, followed by GPT-5.5 at 46%. The benchmark consists of 59 SRE tasks requiring agents to diagnose Kubernetes incidents by analyzing alerts, logs, traces, and topology data. A key finding is that more investigation turns do not improve accuracy — models with longer trajectories (e.g., Gemini 3.1 Pro Preview at 83 turns) score lower than terser models. Open-weight models offer competitive cost-performance tradeoffs, with Gemma 4 31B scoring 37% at only $0.14 per task versus proprietary models costing up to $5.38 per task.

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM