Two different tricks for fast LLM inference
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
Anthropic and OpenAI recently announced fast modes for their coding models, but they use fundamentally different approaches. Anthropic achieves 2.5x speed (170 tokens/sec) by reducing batch sizes while serving the full Opus 4.6 model at 6x the cost. OpenAI achieves 15x speed (1000+ tokens/sec) using specialized Cerebras chips with 44GB on-chip memory, but requires a smaller, less capable distilled model (GPT-5.3-Codex-Spark). The technical tradeoff: Anthropic maintains model quality at moderate speed gains, while OpenAI sacrifices capability for dramatic speed improvements through custom hardware.
Table of contents
How Anthropic’s fast mode worksHow OpenAI’s fast mode worksOpenAI’s version is much more technically impressiveIs fast AI inference the next big thing?Sort: