CUDA, with multi-datatype support and full compatibility with vLLM, S

AutoRound is an Intel-developed quantization toolkit for LLMs and Vision-Language Models that achieves high accuracy at 2–4 bit widths using sign-gradient descent. It supports multiple quantization schemes (W2A16 through W8A16, MXFP4, NVFP4, GGUF formats), integrates with vLLM, SGLang, and Hugging Face Transformers, and runs on CPU, CUDA, Intel XPU, and Gaudi hardware. Key features include quantizing 7B models in ~10 minutes on a single GPU, mixed-precision AutoScheme generation, 10+ VLM support, and export to AutoGPTQ/AutoAWQ/GGUF formats. Recent updates include block-wise FP8 quantization, MTP layer support, and integration into LLM-Compressor.

#python

#llm

May 01•10m read time•From github.com

Table of contents

🚀 What is AutoRound?🆕 What's New ✨ Key Features Installation Model Quantization (CPU/Intel GPU/Gaudi/CUDA)Model Inference Publications & Events Acknowledgement 🌟 Support Us

Comment

Bookmark

Copy

Sort: