Mixture of experts (MoE) layers are utilized to improve the performance of transformer models, particularly large language models (LLMs). MoE layers consist of sparse MoE layers that replace dense feed-forward layers and routers that determine the allocation of tokens to experts. The routing mechanism typically utilizes a
Sort: