Researchers at UCSD integrated DFlash, a block-diffusion speculative decoding framework, into the vLLM TPU inference ecosystem running on Google TPU v5p. Unlike autoregressive speculative decoding that requires O(K) sequential draft passes, DFlash generates an entire block of candidate tokens in a single O(1) forward pass. The implementation required solving three key engineering challenges: a dual-cache architecture to reconcile paged attention with non-causal block diffusion, power-of-2 padding for efficient CPU-TPU context buffer transfers, and state synchronization to prevent sequence length inflation. Benchmarks show an average 3.13x speedup over baseline, with math tasks reaching nearly 6x and a 2.29x advantage over EAGLE-3. A key hardware discovery called 'K-Flat' reveals that on TPU v5p, verifying 1024 tokens costs nearly the same as verifying 16, shifting the research focus from block size to draft quality. The full implementation has been open-sourced in the vLLM tpu-inference repository.
Table of contents
Overcoming autoregressive bottlenecksDiffusion-style drafting on Google TPUsBringing DFlash to TPU/JAXBenchmarking the future of TPU servingSort: