Llama.cpp is an open-source inference engine that enables running large language models locally on consumer hardware like laptops or Raspberry Pis. It achieves this through two key techniques: the GGUF file format (which bundles model weights and metadata for easy loading and swapping) and model quantization (reducing precision from 16-bit to 4-bit, cutting RAM requirements by up to 75%). It supports optimized kernels for NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan, and CPU. Developers can use it via a CLI or spin up an OpenAI-compatible local server on port 8080, making it compatible with tools like LangChain. Popular tools like Ollama, Jan, and GPT4All all use llama.cpp under the hood. Benefits include no API costs, no token limits, and full data privacy.
Sort: