Enterprises adopt open models to control cost, latency, and IP. This session shows how Java teams select, integrate, and operate LLMs using platforms and tools like LangChain4j and vector search to run locally and the cloud. It covers benchmarking, model size vs. throughput, memory footprints on the JVM, response-time tuning, and safety layers. It highlights where the latest GenAI Java projects complement inference pipelines and how to evaluate RAG quality with reproducible metrics. Attendees see end-to-end flows, from data grounding to deployment, with attention to observability, configuration, and rollback strategies.

Presented by Brian Benz (Microsoft) at JavaOne 2026 (CA, March 2026).

All JavaOne 2026 talks ➤ https://www.youtube.com/playlist?list=PLX8CzqL3ArzUMVSzm-z_-if8BIB55EGl4

➤ https://github.com/bbenz/gen-ai-with-open-models

Tags: #Java #JavaOne #AI

LRN1140 PGM 17 104

Java (Official Oracle)

A conference talk covering how Java teams can integrate open and open-weight LLMs into production applications. Topics include why enterprises choose open models (cost, latency, IP control), key terminology (quantization, GGUF, mixture of experts, TTFT), model selection strategies, and live demos using LangChain4j with Ollama. The talk walks through building a RAG pipeline in Java with embedding models and vector search, implementing tool calling to avoid unnecessary LLM queries, adding safety guardrails, and deployment options including Azure Container Apps, AKS, Microsoft AI Foundry, GitHub Models, Hugging Face, NVIDIA build platform, and Docker Model Runner.