unknown

Microsoft has released VibeVoice-ASR, a unified speech-to-text model capable of processing 60-minute audio files in a single pass using a 64K token context window. The model simultaneously performs automatic speech recognition, speaker diarization, and timestamping to produce structured transcripts showing who spoke, when, and what was said. It supports customized hotwords for domain-specific terminology without requiring retraining, targeting meeting and conversational scenarios. The model is evaluated using metrics like DER, cpWER, and tcpWER, and integrates into meeting assistants, analytics tools, and transcription pipelines.

Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass Microsoft VibeVoice ASR is a unified speech to text model for 60 minute…

We are a community of AI/ ML/Generative AI enthusiasts/researchers/journalists/writers who share interesting news and articles about the applications of AI.