Accelerated expert-parallel distributed tuning in Red Hat OpenShift AI

A hands-on guide to fine-tuning Mixture of Experts (MoE) models using fms-hf-tuning on Red Hat OpenShift AI. Covers two key optimizations: MoE scatter-mode kernels (fast_moe) for memory savings and throughput, and expert parallelism for distributing experts across GPUs. Walks through full fine-tuning and LoRA workflows for IBM Granite 4.0 tiny/small hybrid MoE models using Kubeflow Trainer v2, including dataset preparation, training job configuration with FSDPv2, LoRA adapter merging, and serving the fine-tuned model via vLLM on OpenShift AI. Achieves 590 tokens/GPU/second on 4x A100 GPUs.

#deep-learning

#mixture-of-experts

Mar 11•13m read time•From developers.redhat.com

Table of contents

Open source tuning with fms-hf-tuning Mixture of experts kernels, and expert parallelism MoE kernels Expert parallelism Tuning Granite 4 tiny and small MoE hybrid models Prerequisites Prepare the dataset Running a training job Running a training job: Full fine-tuning Running a training job: LoRA Serve the fine-tuned Granite 4.0 models Try fms-hf-tuning

Comment

Bookmark

Copy

Sort: