AutoRound is an Intel-developed quantization toolkit for LLMs and Vision-Language Models that achieves high accuracy at 2–4 bit widths using sign-gradient descent. It supports multiple quantization schemes (W2A16 through W8A16, MXFP4, NVFP4, GGUF formats), integrates with vLLM, SGLang, and Hugging Face Transformers, and runs on CPU, CUDA, Intel XPU, and Gaudi hardware. Key features include quantizing 7B models in ~10 minutes on a single GPU, mixed-precision AutoScheme generation, 10+ VLM support, and export to AutoGPTQ/AutoAWQ/GGUF formats. Recent updates include block-wise FP8 quantization, MTP layer support, and integration into LLM-Compressor.

10m read timeFrom github.com
Post cover image
Table of contents
🚀 What is AutoRound?🆕 What's New✨ Key FeaturesInstallationModel Quantization (CPU/Intel GPU/Gaudi/CUDA)Model InferencePublications & EventsAcknowledgement🌟 Support Us

Sort: