The ik_llama.cpp repository is an improved fork of llama.cpp, featuring enhanced CPU matrix multiplication implementations for both AVX2 and ARM_NEON, leading to significant performance boosts especially in prompt processing and token generation. This fork also supports efficient inference for MoE models and the Bitnet b1.58
Table of contents
TL;DRWhy?Performance comparison to llama.cppMoE modelsBitnet-1.58BTo tile or not to tileSort: