Microsoft released VibeVoice, an open-source text-to-speech model that can generate up to 90 minutes of conversational audio with up to 4 distinct speakers. The model uses continuous speech tokenizers at 7.5 Hz and a next-token diffusion framework combining LLM understanding with diffusion-based acoustic generation. Available in 1.5B and 7B parameter versions on Hugging Face, it supports cross-lingual synthesis and can spontaneously generate background music. The model is designed for research purposes and includes installation instructions, demo examples, and usage guidelines.
4 Comments
Sort: