Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

Researchers at UCSD integrated DFlash, a block-diffusion speculative decoding framework, into the vLLM TPU inference ecosystem running on Google TPU v5p. Unlike autoregressive speculative decoding that requires O(K) sequential draft passes, DFlash generates an entire block of candidate tokens in a single O(1) forward pass. The implementation required solving three key engineering challenges: a dual-cache architecture to reconcile paged attention with non-causal block diffusion, power-of-2 padding for efficient CPU-TPU context buffer transfers, and state synchronization to prevent sequence length inflation. Benchmarks show an average 3.13x speedup over baseline, with math tasks reaching nearly 6x and a 2.29x advantage over EAGLE-3. A key hardware discovery called 'K-Flat' reveals that on TPU v5p, verifying 1024 tokens costs nearly the same as verifying 16, shifting the research focus from block size to draft quality. The full implementation has been open-sourced in the vLLM tpu-inference repository.

#ai-inference

#vllm

May 04•11m read time•From developers.googleblog.com

Table of contents

Overcoming autoregressive bottlenecks Diffusion-style drafting on Google TPUs Bringing DFlash to TPU/JAX Benchmarking the future of TPU serving

Comment

Bookmark

Copy

Sort: