Building a conversational voice AI bot with sub-second response latency using Modal's serverless platform, the Pipecat framework, and open-source models. The implementation achieves ~1 second voice-to-voice latency by combining Parakeet STT, Qwen3 LLM with vLLM, and Kokoro TTS. Key optimizations include using Modal Tunnels to bypass the input plane, WebRTC for client connections, regional pinning to minimize network latency, and independent autoscaling of GPU services. The demo includes a RAG-powered assistant for Modal's documentation with structured outputs and animated avatars.
Table of contents
Conversational Voice AI ApplicationsWhy Modal and Pipecat work so well together for Voice AIVoice-to-Voice LatencyBuilding a Conversational Voice AI for Modal’s DocsTesting PerformanceDeploy Your Own Conversational Voice AI on ModalBonus: Animating Modal’s Mascots Moe and DalSort: