Is This The Fastest ASR?

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

IBM's Granite Speech 4.1 release introduces three ~2B parameter ASR models targeting different use cases. The base model leads the Open ASR Leaderboard with a 5.33% word error rate and processes an hour of audio in ~16 seconds. The Plus model adds speaker diarization, word-level timestamps, and incremental decoding for meeting/podcast transcription, at the cost of fewer supported languages. The third model uses a novel Non-Autoregressive LLM-based Editing (NLE) technique — generating a CTC draft transcript then refining it in parallel — achieving a real-time factor of 1820x on an H100, enabling an hour of audio transcribed in ~2 seconds. All models support keyword biasing (base), run locally via HuggingFace Transformers, and are fine-tunable for domain-specific use cases.

14m watch time

Sort: