Building Voice Agents with ExecuTorch: A Cross-Platform Foundation for On-Device Audio – PyTorch

ExecuTorch, PyTorch's native inference platform, now supports on-device voice workloads across CPU, GPU, and NPU on Linux, macOS, Windows, Android, and iOS. Reference implementations are provided for five voice models: Voxtral Realtime (streaming transcription, ~4B params), Parakeet TDT (offline transcription, 0.6B params), Sortformer (speaker diarization, 117M params), Whisper (offline transcription), and Silero VAD (voice activity detection). The approach uses torch.export() directly on original PyTorch model code with minimal edits, separating model inference from C++ orchestration logic. Quantization (int4/int8) is applied before export. LM Studio is already shipping voice transcription powered by ExecuTorch using Parakeet TDT on macOS and Windows. Sample Android apps and a macOS desktop transcription demo are available in the executorch-examples repository.

#pytorch

#speech-recognition

Mar 14•8m read time•From pytorch.org

Table of contents

TL;DR Voice on the Edge Today Design Principles Voice Models in Practice Sample Applications Adoption Case Study in production: LM Studio Get Involved Acknowledgement

Comment

Bookmark

Copy

Sort: