The Mistral 8x7B 32k model is a Mixture of Experts (MoE) model with 995 tensors, including token embedding, output norm, and output tensors. The model has 32 blocks of attention and ffn. During inference, two experts are used per token, resulting in a faster speed as if using a 12B model. The model has 47B parameters because the FFN layers are treated as individual experts. The model can be run on CPU if there is not enough VRAM on the GPU.
Sort: