Deploying Large Language Models: vLLM and Quantization

Step-by-step guide on how to accelerate large language models. Deployment of Large Language Models and the use of tools like vLLM and quantization. Measurement of latency and throughput. Installation and usage of vLLM. Comparison of vLLM and Hugging Face transformers. Deployment of A Large Language Model with vLLM. Benchmarking Latency and Throughput in Real Time. Quantization of Large Language Models and the use of BitsandBytes library.

#data-science

#llm

Apr 16, 2024•9m read time•From towardsdatascience.com

Table of contents

Step-by-step guide on how to accelerate large language models Deployment of Large Language Models (LLMs)Latency and Throughput Install Required Packages What is Phi-2?Benchmarking LLM Latency and Throughput with Hugging Face Transformers Generated Output How vLLM works vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention Run Phi-2 with vLLM Generated Output Benchmarking Latency and Throughput in Real Time Google Colaboratory BitsandBytes Quantization of Mistral 7B Model NF4(4-bit Normal Float) and Double Quantization QLoRA: Efficient Finetuning of Quantized LLMs Generated Output Google Colaboratory Conclusion vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA Understanding performance benchmarks for LLM inference What are Quantized LLMs?

Comment

Bookmark

Copy

Sort: