The Infrastructure Behind Making Local LLM Agents Actually Useful

A deep technical walkthrough of building a production-grade local LLM agent for scientific workflows on HPC hardware. Part 1 covers vLLM inference optimizations: CUDA graphs (20-25% latency reduction, 3-6x decode throughput), FP8 quantization for weights and KV cache, prefix caching (reducing TTFT from 11,470ms to 706ms on A100), and speculative decoding via Qwen3.6-27B's built-in Multi-Token Prediction head (~89% acceptance rate). Part 2 addresses long-session stability through a structured 'world state' object that persists exact analysis parameters outside the conversation history, combined with accurate token budget accounting, self-calibrating token estimates, and size-prioritized trimming. Together these changes reduce per-iteration latency from 10-15 seconds to 1-3 seconds and allow 50+ iteration sessions to complete without context overflow.

#ai-agents

#vllm

#context-engineering

Yesterday•20m read time•From towardsdatascience.com

Table of contents

Part 1: Making Inference Fast Part 2: Keeping Long Sessions Alive Conclusion

Comment

Bookmark

Copy

Sort: