Mistral's Voxtral-4B-TTS is a powerful text-to-speech model with voice cloning capabilities, but Mistral withheld the encoder weights of its audio autoencoder (Voxtral Codec), preventing users from cloning arbitrary voices. This deep-dive covers the model's architecture — including its autoregressive LLM backbone, discrete audio tokens, finite scalar quantization (FSQ), and semantic/acoustic token design — and investigates workarounds. Key findings: semantic tokens don't actually encode word meaning, and the decoder is robust to code perturbations. A gradient descent approach using straight-through estimators (STE) for both acoustic and semantic tokens, combined with multi-resolution STFT loss and speaker embedding loss, can reconstruct audio codes from target audio without the encoder — enabling approximate voice cloning. Training took ~1 hour on Apple M-series hardware for 8 seconds of audio. Code and notebooks are provided on GitHub.

13m read timeFrom towardsdatascience.com
Post cover image
Table of contents
Voxtral TTS overviewDo semantic tokens really represent semantics?A gradient descent approach to reconstruct codes when the encoder is missingAI usage disclaimerContacts

Sort: