Two different tricks for fast LLM inference

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Anthropic and OpenAI recently announced fast modes for their coding models, but they use fundamentally different approaches. Anthropic achieves 2.5x speed (170 tokens/sec) by reducing batch sizes while serving the full Opus 4.6 model at 6x the cost. OpenAI achieves 15x speed (1000+ tokens/sec) using specialized Cerebras chips

9m read timeFrom seangoedecke.com
Post cover image
Table of contents
How Anthropic’s fast mode worksHow OpenAI’s fast mode worksOpenAI’s version is much more technically impressiveIs fast AI inference the next big thing?

Sort: