Cloudflare has announced new infrastructure optimizations for running large language models at scale across its global network. Key improvements include disaggregated prefill, which splits LLM request processing into compute-bound prefill and memory-bound decode stages handled by separate machines. Cloudflare also built a custom inference engine called Infire that supports pipeline and tensor parallelism to efficiently distribute models across multiple GPUs, reducing memory usage and improving throughput. Additionally, the company introduced Unweight, a compression system that reduces LLM weight sizes by 15–22% without accuracy loss. These optimizations enable running models like Llama 4 Scout on two H200 GPUs and Kimi K2.5 on eight H100 GPUs with KV-cache headroom — configurations that would struggle with standard inference engines like vLLM.

4m read timeFrom infoq.com
Post cover image

Sort: