NVIDIA TensorRT-LLM enhancements deliver massive speedups on Llama 2 70B and enable Falcon-180B to run on a single GPU. It achieves a 6.7x performance boost on the H200 GPU for Llama 2 70B and provides excellent inference throughput for Falcon-180B with reduced memory footprint.
•4m read time• From developer.nvidia.com
Table of contents
Llama 2 70B on H200 delivers a 6.7x performance boostFalcon-180B performance examinedStay on targetOngoing work1 Comment
Sort: