Customizing large language models (LLMs) is made easier and more efficient with the use of Low-Rank Adaptation (LoRA), a fine-tuning method that reduces training time and memory requirement. LoRA introduces low-rank matrices into the LLM architecture and only trains these matrices while keeping the original LLM weights frozen.

12m read timeFrom developer.nvidia.com
Post cover image
Table of contents
Tutorial prerequisitesWhat is LoRA?The math behind LoRAMulti-LoRA deploymentLoRA tuningLoRA inferenceSet up and build TensorRT-LLMRetrieve model weightsCompile the modelRun the modelDeploying LoRA tuned models with Triton and inflight batchingConclusion

Sort: