Cloudflare has recently announced new infrastructure designed to run large AI language models across its global network. As these models rely on costly hardware and must handle large volumes of incomi

InfoQ is a leading online platform for software developers, architects, and technical leaders, providing news, articles, presentations, and interviews on a wide range of topics, including agile practices, DevOps, microservices, and emerging technologies. With a focus on quality content and expert insights, InfoQ helps professionals stay informed about the latest trends, best practices, and industry developments. Developers can learn from real-world experiences, gain  knowledge, and connect with peers in the global software community through InfoQ's diverse and engaging content.

InfoQ

Cloudflare has announced new infrastructure optimizations for running large language models at scale across its global network. Key improvements include disaggregated prefill, which splits LLM request processing into compute-bound prefill and memory-bound decode stages handled by separate machines. Cloudflare also built a custom inference engine called Infire that supports pipeline and tensor parallelism to efficiently distribute models across multiple GPUs, reducing memory usage and improving throughput. Additionally, the company introduced Unweight, a compression system that reduces LLM weight sizes by 15–22% without accuracy loss. These optimizations enable running models like Llama 4 Scout on two H200 GPUs and Kimi K2.5 on eight H100 GPUs with KV-cache headroom — configurations that would struggle with standard inference engines like vLLM.

Cloudflare Builds High-Performance Infrastructure for Running LLMs