NVIDIA offers tools like Perf Analyzer and Model Analyzer to help optimize ML inference performance, particularly for large language models (LLMs) by measuring metrics such as time to first token, output token throughput, and inter-token latency. The latest tool, GenAI-Perf, introduced with NVIDIA Triton, provides accurate

6m read timeFrom developer.nvidia.com
Post cover image
Table of contents
Introducing GenAI-PerfCurrently supported endpointsRunning GenAI-PerfConclusion

Sort: