Multi-cluster GKE Inference Gateway scales AI/ML inference across multiple Google Cloud regions.

Google Cloud Platform provides a suite of cloud computing services for building, deploying, and managing applications and infrastructure on Google's global network. Developers can learn about cloud-native development, machine learning, and big data analytics to leverage GCP's scalable and reliable cloud infrastructure for their projects.

Google Cloud

Google Cloud has announced the preview of multi-cluster GKE Inference Gateway, designed to scale AI/ML inference workloads across multiple GKE clusters and Google Cloud regions. It addresses limitations of single-cluster deployments such as availability risks, GPU/TPU resource silos, scalability caps, and latency for global users. Key features include intelligent model-aware load balancing using custom metrics (e.g., KV cache utilization via GCPBackendPolicy), fault-tolerant cross-region traffic routing, and simplified management through a single config cluster. The architecture uses Kubernetes Custom Resources including InferencePool, InferenceObjective, GCPInferencePoolImport, and GCPBackendPolicy to enable global low-latency serving, disaster recovery, and capacity bursting.

Multi-cluster GKE Inference Gateway helps scale AI workloads