The post describes an end-to-end Quantization-Aware Training (QAT) process in PyTorch for large language models. It highlights how QAT can significantly improve accuracy and reduce perplexity degradation compared to post-training quantization (PTQ). Users can leverage QAT APIs in torchao for fine-tuning models in torchtune. Experimental results demonstrate substantial improvements in model performance when QAT is applied, particularly for the Llama3 model. The post also discusses future directions such as mixed-precision quantization, hyperparameter tuning, and extending QAT to other layers and more complex data types.
Sort: