A Guide to Voice Cloning on Voxtral with a Missing Encoder

Mistral's Voxtral-4B-TTS is a powerful text-to-speech model with voice cloning capabilities, but Mistral withheld the encoder weights of its audio autoencoder (Voxtral Codec), preventing users from cloning arbitrary voices. This deep-dive covers the model's architecture — including its autoregressive LLM backbone, discrete audio tokens, finite scalar quantization (FSQ), and semantic/acoustic token design — and investigates workarounds. Key findings: semantic tokens don't actually encode word meaning, and the decoder is robust to code perturbations. A gradient descent approach using straight-through estimators (STE) for both acoustic and semantic tokens, combined with multi-resolution STFT loss and speaker embedding loss, can reconstruct audio codes from target audio without the encoder — enabling approximate voice cloning. Training took ~1 hour on Apple M-series hardware for 8 seconds of audio. Code and notebooks are provided on GitHub.

#text-to-speech

#gradient-descent

#voice-cloning

Apr 10•13m read time•From towardsdatascience.com

Table of contents

Voxtral TTS overview Do semantic tokens really represent semantics?A gradient descent approach to reconstruct codes when the encoder is missing AI usage disclaimer Contacts

Comment

Bookmark

Copy

Sort: