A conference talk covering how Java teams can integrate open and open-weight LLMs into production applications. Topics include why enterprises choose open models (cost, latency, IP control), key terminology (quantization, GGUF, mixture of experts, TTFT), model selection strategies, and live demos using LangChain4j with Ollama. The talk walks through building a RAG pipeline in Java with embedding models and vector search, implementing tool calling to avoid unnecessary LLM queries, adding safety guardrails, and deployment options including Azure Container Apps, AKS, Microsoft AI Foundry, GitHub Models, Hugging Face, NVIDIA build platform, and Docker Model Runner.
•50m watch time
Sort: