How OpenAI delivers low-latency voice AI at scale

OpenAI rebuilt its WebRTC stack to handle real-time voice AI at scale for over 900 million weekly active users. The core challenge was that the conventional one-port-per-session WebRTC model doesn't fit Kubernetes infrastructure well. Their solution splits packet routing from protocol termination using a relay plus transceiver architecture: a lightweight stateless relay layer handles UDP forwarding with a small public footprint, while stateful transceivers own all WebRTC session state (ICE, DTLS, SRTP). Routing is achieved by encoding destination metadata into the ICE username fragment (ufrag), enabling deterministic first-packet routing without hot-path lookups. The relay is written in Go using SO_REUSEPORT, thread pinning, and pre-allocated buffers for efficiency. A globally distributed relay fleet (Global Relay) reduces first-hop latency by placing ingress close to users worldwide, with Cloudflare geo-steering for signaling.

#kubernetes

#golang

#webrtc

Yesterday•14m read time•From openai.com

Table of contents

WebRTC lets us make real-time AI products Choosing a media architecture The core deployment problem: WebRTC meets Kubernetes Architecture overview: relay + transceiver Routing on ICE credentials Global Relay and geo-steered signaling Relay implementation and performance Results and learnings

Comment

Bookmark

Copy

Sort: