HuggingFace's platform is a resource for developers and researchers working in natural language processing (NLP) and machine learning, offering insights into NLP models, tools, and datasets. Through articles, tutorials, and open-source projects, HuggingFace offers insights into state-of-the-art NLP techniques, transformer architectures, and transfer learning methods. Developers can learn about using pre-trained models, fine-tuning strategies, and deploying NLP applications with HuggingFace's libraries and APIs.

Hugging Face

SPEED-Bench is a new unified benchmark from NVIDIA for evaluating speculative decoding (SD) in LLM inference. Existing benchmarks are fragmented, use small prompt sets, and fail to reflect real-world serving conditions. SPEED-Bench addresses this with two dataset splits: a Qualitative split (880 prompts across 11 semantic domains, selected for maximum diversity) and a Throughput split (ISL buckets from 1k–32k tokens at high concurrency). A unified measurement framework integrates with TensorRT-LLM, vLLM, and SGLang, using pre-tokenized inputs to ensure fair cross-engine comparisons. Key findings include: SD acceptance length is highly domain-dependent (Coding/Math outperform Roleplay/Writing); co-trained MTP heads outperform post-trained drafters like EAGLE3; vocabulary pruning degrades acceptance on multilingual and RAG tasks; and random token inputs overestimate SD throughput by ~23%. The dataset and framework are openly available on Hugging Face and GitHub.

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

The Qualitative split: semantic coverage and draft accuracy

The Throughput split: realistic serving workloads