A deep technical walkthrough of building a production-grade local LLM agent for scientific workflows on HPC hardware. Part 1 covers vLLM inference optimizations: CUDA graphs (20-25% latency reduction, 3-6x decode throughput), FP8 quantization for weights and KV cache, prefix caching (reducing TTFT from 11,470ms to 706ms on A100), and speculative decoding via Qwen3.6-27B's built-in Multi-Token Prediction head (~89% acceptance rate). Part 2 addresses long-session stability through a structured 'world state' object that persists exact analysis parameters outside the conversation history, combined with accurate token budget accounting, self-calibrating token estimates, and size-prioritized trimming. Together these changes reduce per-iteration latency from 10-15 seconds to 1-3 seconds and allow 50+ iteration sessions to complete without context overflow.

20m read timeFrom towardsdatascience.com
Post cover image
Table of contents
Part 1: Making Inference FastPart 2: Keeping Long Sessions AliveConclusion

Sort: