Effort is a new algorithm for LLM inference that allows for real-time adjustment of calculations during inference. It is implemented for Mistral and does not require retraining. The implementation is currently available for FP16 only.

3m read timeFrom kolinko.github.io
Post cover image

Sort: