This video walks you through building a fully self-hosted AI inference platform on Kubernetes, giving your organization the ability to run large language models on infrastructure you control. If you're in healthcare, finance, government, or any field where data privacy and regulatory compliance matter, sending prompts through third-party APIs may not be an option — and this guide shows you the alternative. The video covers why inference (as opposed to training or fine-tuning) is the critical piece for most teams, examines the current landscape of open-weight models including the rapid rise of Chinese models like Qwen and DeepSeek, and honestly addresses the trade-offs of self-hosting versus using commercial APIs.

From there, the video moves into a hands-on build using Crossplane and Kubernetes with GPU nodes on AWS. You'll see how to define simple custom resources that let any team in your company provision a GPU-enabled cluster and deploy a model — without needing to understand the underlying complexity of EKS node groups, NVIDIA GPU Operators, or vLLM configuration. By the end, you have a working Inference-as-a-Service platform serving an OpenAI-compatible API endpoint, fully contained within your own network. The video also lays out the architecture and sets the stage for future topics like disaggregated inference, KV-cache routing, autoscaling, and multi-cluster patterns.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Sponsor: Kilo Code
🔗 https://kilo.ai
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

#SelfHostedAI #KubernetesInference #InferenceAsAService

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬
➡ Transcript and commands: https://devopstoolkit.live/kubernetes/building-inference-as-a-service-on-kubernetes
🔗 Crossplane: https://crossplane.io
🎬 Why Self-Hosting AI Models Is a Bad Idea: https://youtu.be/pWtDTkfNaUU

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬
If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬
➡ BlueSky: https://vfarcic.bsky.social
➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬
🎤 Podcast: https://www.devopsparadox.com/
💬 Live streams: https://www.youtube.com/c/DevOpsParadox

▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬
00:00 AI Inference (Self-Managed)
01:24 Kilo Code (sponsor)
02:53 Self-Hosted AI Inference Explained
13:43 GPU Kubernetes Cluster Setup
15:35 Deploy and Serve LLMs
18:48 Inference Platform Architecture

DevOps Toolkit's resource offers insights, tutorials, and resources for DevOps engineers and practitioners. Readers can learn about DevOps best practices, automation techniques, and tools for continuous integration and deployment. With articles, guides, and case studies, DevOps Toolkit provides  guidance and expertise for streamlining software delivery pipelines and improving collaboration between development and operations teams.

DevOps Toolkit

A walkthrough of building a self-hosted AI inference platform on Kubernetes for organizations with strict data compliance requirements. Covers the rationale for self-hosting (data sovereignty, compliance, cost at scale), the landscape of open-weight models, and a practical demo using Crossplane to provision GPU-enabled EKS clusters and deploy models via vLLM. The setup exposes an OpenAI-compatible API endpoint so any existing tooling works without code changes. Crossplane compositions abstract away GPU operator configuration, node group setup, and vLLM wiring behind simple custom resources, letting teams deploy models by filling in a few fields.

Building Inference-as-a-Service on Kubernetes