Microsoft released VibeVoice, an open-source text-to-speech model that can generate up to 90 minutes of conversational audio with up to 4 distinct speakers. The model uses continuous speech tokenizers at 7.5 Hz and a next-token diffusion framework combining LLM understanding with diffusion-based acoustic generation. Available
4 Comments
Sort: