A deep-dive into building production-quality real-time conversational voice AI on Android, going far beyond the basic STT→LLM→TTS pipeline. Covers a five-state machine (IDLE, LISTENING, THINKING, SPEAKING, ERROR) using Kotlin StateFlow and collectLatest for clean cancellation, streaming STT via Deepgram WebSocket with AudioRecord using VOICE_COMMUNICATION source for hardware AEC, a session ID race condition fix using AtomicInteger, WebSocket pre-warming to eliminate handshake latency, sentence-level TTS streaming to reduce perceived latency to 600–800ms, WebRTC VAD-based barge-in detection with debouncing to prevent false triggers, a single shared AudioRecord instance to avoid device-specific restart failures, and backchannel filler phrases for slow inference. Measured full-turn latency is 1.2–1.6s with barge-in response under 150ms.

12m read timeFrom proandroiddev.com
Post cover image

Sort: