Microsoft has released VibeVoice-ASR, a unified speech-to-text model capable of processing 60-minute audio files in a single pass using a 64K token context window. The model simultaneously performs automatic speech recognition, speaker diarization, and timestamping to produce structured transcripts showing who spoke, when, and

Sort: