Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR

NVIDIA introduces Nemotron Speech ASR, an open model that uses cache-aware streaming architecture to process real-time voice interactions. Unlike traditional buffered inference systems that repeatedly reprocess overlapping audio windows, this approach maintains an internal cache of encoder representations and processes each audio frame exactly once. The model achieves 3x higher efficiency, supports 560 concurrent streams on H100 GPUs, maintains stable latency under load, and delivers 24ms median time-to-final transcription. Real-world validation from Daily and Modal demonstrates zero latency drift at scale, enabling natural conversational agents with sub-900ms voice-to-voice loops.

#machine-learning

#nvidia

#real-time-systems

#voice-ai

Jan 05•9m read time•From huggingface.co

Table of contents

The Challenge: Why Streaming ASR Breaks at Scale The Solution: Cache-Aware Streaming ASR for Lower Latency, Linear Scale, and Predictable Cost Results: Throughput, Accuracy, and Speed at Scale Real-World Validation Conclusion: A New Baseline for Real-Time Voice Agents

Comment

Bookmark

Copy

Sort: