NVIDIA TensorRT has developed an 8-bit post-training quantization toolkit to speed up diffusion deployment on NVIDIA hardware while preserving image quality. The performance of TensorRT INT8 and FP8 quantization recipes for diffusion models achieve significant speedups on NVIDIA RTX 6000 Ada GPUs. SmoothQuant is a popular PTQ method for diffusion models, but it has limitations. TensorRT has developed a fine-grained tuning pipeline called SmoothQuant to address these limitations. TensorRT 8-bit quantization can be used to accelerate diffusion models by calibrating, exporting ONNX, and building the TensorRT engine.

5m read timeFrom developer.nvidia.com
Post cover image
Table of contents
BenchmarkingTensorRT Solution: overcoming inference speed challengesUsing TensorRT 8-bit quantization to accelerate diffusion modelsConclusion

Sort: