ggml-org/llama.cpp: LLM inference in C/C++
The ggml-org/llama.cpp project provides a pure C/C++ implementation for the inference of Meta's LLaMA models with minimal setup and high performance across various hardware platforms. It supports Apple silicon, x86 architectures with AVX support, and custom CUDA kernels for NVIDIA GPUs. The project also facilitates model quantization to various bit levels for faster inference and reduced memory usage. Additionally, it includes multiple bindings for different programming languages, plugins for popular code editors, and a variety of supported models.