Building production-grade AI voice agents requires solving problems that demos never show: latency budgets, interruption handling, and graceful failure modes. End-to-end latency must stay under 800ms, requiring streaming at every pipeline stage (STT, LLM, TTS) and geographically close inference. Interruption handling needs a real voice activity detector with a short hold window to avoid false triggers. The recommended architecture separates a real-time voice pipeline from an orchestration layer for tool calls and business logic, using conversational filler to mask lookup latency. Tool use in voice is harder than in chat due to strict latency constraints, so pre-fetching data and using fast models for tool call generation are critical. Common failure modes include state corruption causing script loops, hallucination, prompt injection via STT, and misrouting. Cost runs 8β20 cents per minute; speech-to-speech models, prompt caching, and concise responses help control it. The practical recommendation is to start with a speech-to-speech provider, design failure paths before happy paths, and listen to real call recordings regularly.
Table of contents
Why Voice Is Different From ChatThe Latency Budget ProblemInterruption Handling, Which Is Harder Than It SoundsThe Architecture That Actually ShipsTool Use in Voice Agents Is Harder Than in ChatFailure Modes And How To Catch ThemCost Considerations Worth InternalizingWhat I Would Build First If I Were Starting NowSort: