The dominant architecture pattern for text-to-speech in 2026 looks a lot like an LLM — an autoregressive transformer generating sequences of tokens, one frame of audio at a time. Samuel Humeau from Mistral walks through why the field converged there, how neural audio codecs solve the information-density problem (audio carries ~200kbps of signal; you can't feed that raw to a transformer), and what the streaming trick actually is that makes voice agents feel responsive before the full audio has even finished generating.

The talk uses Mistral's just-released open-weight TTS model as a running example — live demos of voice cloning from a few seconds of reference audio, a voice agent answering real conference schedule questions, and a breakdown of the codec-to-backbone-to-decoder pipeline that produces it all. There's also a frank section on what's still unsettled: how to handle streaming text input (tokens arriving from an LLM in real time rather than a fixed block of text) and why getting that right is the next meaningful latency win in agent pipelines.

It's the kind of talk that makes the system feel less like a black box — not by oversimplifying, but by showing exactly which engineering choices are load-bearing and which are still open problems.

Speaker info:
- https://x.com/DrSamuelBHume
- https://www.linkedin.com/in/samuelhumeau/

AI Engineer

A Mistral AI scientist presents the architecture behind modern text-to-speech (TTS) models, explaining why they increasingly resemble LLMs. Key topics include: framing TTS as a language modeling problem using autoregressive decoders, audio tokenization via codecs (compressing audio frames into discrete tokens), the challenge of high audio bitrates vs. sparse text information, and latency reduction through streaming audio packets. The talk also covers Mistral's newly released open-source TTS model, which uses a diffusion/flow-matching approach for token generation per frame, voice cloning capabilities (with the encoder kept proprietary), and architectural patterns for real-time text-input streaming in voice agent pipelines.

Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral