Pocket TTS is a 100M-parameter text-to-speech model with voice cloning capabilities that runs in real time on CPUs. Unlike larger LLM-based TTS models requiring GPUs or smaller specialized models with fixed voices, it bridges the gap by using continuous audio latents instead of discrete tokens. The model achieves the lowest

9m read time From kyutai.org
Post cover image
Table of contents
Kyutai Pocket TTSEvaluationArchitectureDataScientific contributionsAuthors

Sort: