TorchAO's Quantization-Aware Training (QAT) has been extended with new integrations and techniques. Key highlights include: integration with Unsloth recovering up to 66.9% accuracy degradation using INT4 QAT+LoRA with 1.73x inference speedup; integration with Axolotl supporting NVFP4 QAT recovering up to 71.6% accuracy degradation on Gemma3-27B with 1.35x speedup and 1/4 HBM usage on B200 GPUs; and PARQ, a new optimizer-based QAT algorithm enabling 3-bit per-row models to match 4-bit per-group PTQ baselines while using only ~58% memory and decoding at ~1.57x faster throughput. PARQ is demonstrated on Phi-4-mini-instruct with ExecuTorch deployment on iPhone 15 Pro. Future directions include RL integration, GPU kernel acceleration during QAT, and further framework integrations.

12m read timeFrom pytorch.org
Post cover image
Table of contents
Quantization-Aware TrainingIntegration with UnslothIntegration with AxolotlPiecewise-Affine Regularized Quantization (PARQ)Looking AheadAcknowledgements

Sort: