In this blog post, the MosaicML engineering team shares best practices for how to capitalize on popular open source large language models (LLMs) for production usage.

databricks

LLMs generate text in a two-step process: prefill and decoding. Key metrics for LLM serving include time to first token, time per output token, latency, and throughput. Optimizing LLM inference involves techniques such as operator fusion, quantization, compression, and parallelization. Batch size impacts latency and throughput in LLM inference.

LLM Inference Performance Engineering: Best Practices