Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass Microsoft VibeVoice ASR is a unified speech to text model for 60 minute…Read post
Microsoft has released VibeVoice-ASR, a unified speech-to-text model capable of processing 60-minute audio files in a single pass using a 64K token context window. The model simultaneously performs automatic speech recognition, speaker diarization, and timestamping to produce structured transcripts showing who spoke, when, and
Sort: