Superhuman and Databricks jointly migrated a grammar correction LLM workload to Databricks Model Serving, achieving 200K+ QPS peak throughput with sub-second P99 latency and 4 nines availability. Key engineering work included a custom power-of-two-choices load balancer via an Endpoint Discovery Service, asymmetric autoscaling to handle diurnal traffic spikes, lazy-loading container images to cut cold start times from minutes to seconds, FP8 quantization (with per-channel scaling) for a 30% throughput gain, and a multiprocessing RPC server to eliminate CPU bottlenecks on fast small models. Combined, these optimizations raised per-pod throughput from 750 to 1,200 QPS on H100 GPUs — a 60% improvement — with zero quality regression on Superhuman's evaluation harnesses.
Table of contents
From analytics partners to real-time inference partnersHow Superhuman modernized its serving stackMeeting real-time SLAs on Platform InfrastructureRuntime optimizations: 60% more throughput per podWhat's nextKey takeawaysSort: