RamaLama is a Red Hat open source project that uses containers (Podman or Docker) to run open source LLMs locally and in production environments like Kubernetes. The talk covers running models via llama.cpp or vLLM inference engines, benchmarking local model performance, containerizing AI workloads with security isolation flags, building RAG pipelines using Dockling for document ingestion and vector databases, generating systemd Quadlets and Kubernetes YAML manifests for deployment, and building agentic AI applications with LangChain4j. The core idea is solving the 'works on my machine' problem for AI by treating models as versioned container artifacts that move consistently from laptop to cluster.
•47m watch time
Sort: