A deep dive into building a multi-agent software engineering system using large language models, covering technical challenges with model quantization, inference engines, and hardware optimization. The author explores DeepSeek v2.5 MoE architecture, discusses various quantization techniques like W4A16, and shares experiences
Table of contents
Agents, AI, and Replit’s Next NemesisWhat are Agents?Up Late, Fighting Battles No One Knows About 😅Wait, 192GB of VRAM Isn’t Enough?!LLM Architectures & Inference EnginesMixture of Experts ArchitecturesBatch Inference and CPU OffloadingvLLM, ExLlamaV2, Llama.cpp, and Tensor ParallelismQuantization, Mixed Precision, Weights and ActivationsTensor Parallelism, Again!Let’s QuantizeWhat’s Next?Sort: