SWE Agentic Framework, MoEs, Quantizations & Mixed Precision, Batch Inference, LLM Architectures, vLLM, DeepSeek v2.5, Embedding Models, and Speculative Decoding: An LLM Brain Dump... I have been working on a multi-agent system that simulates a team of Software Engineers; this system assigns projects, creates teams and adds members to them based on areas of expertise and need, and asks team members to build features, assign story points, have pair programming sessions together, etc.

Osman's Odyssey: Byte & Build

A deep dive into building a multi-agent software engineering system using large language models, covering technical challenges with model quantization, inference engines, and hardware optimization. The author explores DeepSeek v2.5 MoE architecture, discusses various quantization techniques like W4A16, and shares experiences with tensor parallelism and batch inference on a custom AI server with 192GB VRAM.

Serving AI From The Basement — Part II : Unpacking SWE Agentic Framework, MoEs, Batch Inference, and More · Osman's Odyssey: Byte & Build

Up Late, Fighting Battles No One Knows About 😅

vLLM, ExLlamaV2, Llama.cpp, and Tensor Parallelism

Quantization, Mixed Precision, Weights and Activations