Two different tricks for fast LLM inference

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Anthropic and OpenAI recently announced fast modes for their coding models, but they use fundamentally different approaches. Anthropic achieves 2.5x speed (170 tokens/sec) by reducing batch sizes while serving the full Opus 4.6 model at 6x the cost. OpenAI achieves 15x speed (1000+ tokens/sec) using specialized Cerebras chips with 44GB on-chip memory, but requires a smaller, less capable distilled model (GPT-5.3-Codex-Spark). The technical tradeoff: Anthropic maintains model quality at moderate speed gains, while OpenAI sacrifices capability for dramatic speed improvements through custom hardware.

9m read timeFrom seangoedecke.com
Post cover image
Table of contents
How Anthropic’s fast mode worksHow OpenAI’s fast mode worksOpenAI’s version is much more technically impressiveIs fast AI inference the next big thing?

Sort: