A Mistral AI scientist presents the architecture behind modern text-to-speech (TTS) models, explaining why they increasingly resemble LLMs. Key topics include: framing TTS as a language modeling problem using autoregressive decoders, audio tokenization via codecs (compressing audio frames into discrete tokens), the challenge of high audio bitrates vs. sparse text information, and latency reduction through streaming audio packets. The talk also covers Mistral's newly released open-source TTS model, which uses a diffusion/flow-matching approach for token generation per frame, voice cloning capabilities (with the encoder kept proprietary), and architectural patterns for real-time text-input streaming in voice agent pipelines.

22m watch time

Sort: