Superhuman began serving its grammar correction model via Databricks’ Foundation Model API, handling>200K QPS with p99 latency under 1s. Through a close engineering partnership with Databricks, both teams optimized runtime performance to deliver a 60% throughput gain, while maintaining 4 9’s of availability

databricks

Superhuman and Databricks jointly migrated a grammar correction LLM workload to Databricks Model Serving, achieving 200K+ QPS peak throughput with sub-second P99 latency and 4 nines availability. Key engineering work included a custom power-of-two-choices load balancer via an Endpoint Discovery Service, asymmetric autoscaling to handle diurnal traffic spikes, lazy-loading container images to cut cold start times from minutes to seconds, FP8 quantization (with per-channel scaling) for a 30% throughput gain, and a multiprocessing RPC server to eliminate CPU bottlenecks on fast small models. Combined, these optimizations raised per-pod throughput from 750 to 1,200 QPS on H100 GPUs — a 60% improvement — with zero quality regression on Superhuman's evaluation harnesses.

How Superhuman and Databricks built a 200K QPS inference platform together

From analytics partners to real-time inference partners

How Superhuman modernized its serving stack

Meeting real-time SLAs on Platform Infrastructure

Runtime optimizations: 60% more throughput per pod