The story of Gradium, the anatomy of audio AI models, and why smaller labs continue to edge out larger ones when it comes to voice.

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

Gradium, born from the open audio lab Kyutai, demonstrates how small teams of specialized researchers are outperforming major AI labs in audio AI. The team built Moshi, the first full-duplex conversational AI model with 160ms latency, using only 4 researchers in 6 months. Their success stems from deep domain expertise in audio codecs, novel multi-stream architectures for turn-taking, and the fact that audio models require far less compute than text models (7B parameters vs 405B). The article explores why audio AI has been historically underfunded, the technical innovations behind real-time voice models including the Mimi codec, and why focused teams with genuine audio expertise can move faster than large labs dealing with multimodal compromises.

Arming the rebels with GPUs: Gradium, Kyutai, and Audio AI

A brief history of audio ML, and why it’s consistently overlooked

Dynamics of big labs and why small teams of researchers can outperform

Audio model architectures: speech-to-speech vs. full duplex