TorchAO's Quantization-Aware Training (QAT) has been extended with new integrations and techniques. Key highlights include: integration with Unsloth recovering up to 66.9% accuracy degradation using INT4 QAT+LoRA with 1.73x inference speedup; integration with Axolotl supporting NVFP4 QAT recovering up to 71.6% accuracy degradation on Gemma3-27B with 1.35x speedup and 1/4 HBM usage on B200 GPUs; and PARQ, a new optimizer-based QAT algorithm enabling 3-bit per-row models to match 4-bit per-group PTQ baselines while using only ~58% memory and decoding at ~1.57x faster throughput. PARQ is demonstrated on Phi-4-mini-instruct with ExecuTorch deployment on iPhone 15 Pro. Future directions include RL integration, GPU kernel acceleration during QAT, and further framework integrations.
Table of contents
Quantization-Aware TrainingIntegration with UnslothIntegration with AxolotlPiecewise-Affine Regularized Quantization (PARQ)Looking AheadAcknowledgementsSort: