This article explains how to run AI models locally in containers with RamaLama and integrate the sample Java application with them.

Piotr Minkowski TechBlog

RamaLama is a tool that simplifies running AI models locally inside containers using Podman, Docker, or Kubernetes. It supports multiple model registries including Ollama, HuggingFace, and OCI registries, and automatically handles GPU acceleration. The guide walks through installing RamaLama and Podman on macOS, running models like tinyllama and Gemma, and integrating them with a Spring Boot application via Spring AI's OpenAI-compatible module. It also covers deploying AI model containers on Minikube with GPU support using the generic-device-plugin and krunkit driver for Apple Silicon GPU acceleration.

AI Models in Containers with RamaLama

Integrate Spring AI with Models on RamaLama

Use RamaLama to Run Containers with AI Models in Kubernetes