Customizing large language models (LLMs) is made easier and more efficient with the use of Low-Rank Adaptation (LoRA), a fine-tuning method that reduces training time and memory requirement. LoRA introduces low-rank matrices into the LLM architecture and only trains these matrices while keeping the original LLM weights frozen.
Table of contents
Tutorial prerequisitesWhat is LoRA?The math behind LoRAMulti-LoRA deploymentLoRA tuningLoRA inferenceSet up and build TensorRT-LLMRetrieve model weightsCompile the modelRun the modelDeploying LoRA tuned models with Triton and inflight batchingConclusionSort: