In this video, I dive into IBM's newly released Granite Speech 4.1 models and explore what makes them interesting — particularly the three 2B variants they've dropped and how each one makes a different trade-off between accuracy, richness, and throughput that you'll actually care about for real applications.

🔗 Links:
IBM Research Blog → https://research.ibm.com/blog/granite-4-1-ai-foundation-models

Twitter: https://x.com/Sam_Witteveen 

🕵️ Interested in building LLM Agents? Fill out the form below
Building LLM Agents Form: https://drp.li/dIMes

👨‍💻Github:
https://github.com/samwit/llm-tutorials

⏱️Time Stamps:
00:00 Intro
00:20 IBM Granite Collection
00:27 Granite Docling
00:46 Granite Speech 4.1
01:16 Granite 4.1 Blog
01:38 Granite Speech 4.1 2B
04:02 Granite Speech 4.1 2B Plus
06:15 Granite Speech 4.1 2B NAR
07:30 NLE: Non-autoregressive LLM-based ASR by Transcript Editing Paper
07:45 Architecture
09:45 Code Time
12:00 Granite Speech Model Github

#DellProPrecision #DellProMax #Delltech  #localai #NVIDIA

Sam Witteveen AI is a publication offering insights, tutorials, and resources for artificial intelligence (AI) enthusiasts and practitioners. Readers can learn about machine learning algorithms, deep learning frameworks, and AI applications. With tutorials, case studies, and expert interviews, Sam Witteveen AI provides  guidance and expertise for building and deploying AI solutions.

Sam Witteveen

IBM's Granite Speech 4.1 release introduces three ~2B parameter ASR models targeting different use cases. The base model leads the Open ASR Leaderboard with a 5.33% word error rate and processes an hour of audio in ~16 seconds. The Plus model adds speaker diarization, word-level timestamps, and incremental decoding for meeting/podcast transcription, at the cost of fewer supported languages. The third model uses a novel Non-Autoregressive LLM-based Editing (NLE) technique — generating a CTC draft transcript then refining it in parallel — achieving a real-time factor of 1820x on an H100, enabling an hour of audio transcribed in ~2 seconds. All models support keyword biasing (base), run locally via HuggingFace Transformers, and are fine-tunable for domain-specific use cases.

Is This The Fastest ASR?