NVIDIA introduces Nemotron Speech ASR, an open model that uses cache-aware streaming architecture to process real-time voice interactions. Unlike traditional buffered inference systems that repeatedly reprocess overlapping audio windows, this approach maintains an internal cache of encoder representations and processes each audio frame exactly once. The model achieves 3x higher efficiency, supports 560 concurrent streams on H100 GPUs, maintains stable latency under load, and delivers 24ms median time-to-final transcription. Real-world validation from Daily and Modal demonstrates zero latency drift at scale, enabling natural conversational agents with sub-900ms voice-to-voice loops.
Table of contents
The Challenge: Why Streaming ASR Breaks at ScaleThe Solution: Cache-Aware Streaming ASR for Lower Latency, Linear Scale, and Predictable CostResults: Throughput, Accuracy, and Speed at ScaleReal-World ValidationConclusion: A New Baseline for Real-Time Voice AgentsSort: