sean goedecke

Anthropic and OpenAI recently announced fast modes for their coding models, but they use fundamentally different approaches. Anthropic achieves 2.5x speed (170 tokens/sec) by reducing batch sizes while serving the full Opus 4.6 model at 6x the cost. OpenAI achieves 15x speed (1000+ tokens/sec) using specialized Cerebras chips with 44GB on-chip memory, but requires a smaller, less capable distilled model (GPT-5.3-Codex-Spark). The technical tradeoff: Anthropic maintains model quality at moderate speed gains, while OpenAI sacrifices capability for dramatic speed improvements through custom hardware.

Two different tricks for fast LLM inference

OpenAI’s version is much more technically impressive