A practical guide to parameter-efficient fine-tuning of NVIDIA Cosmos Predict 2.5, a 2B-parameter world model, using LoRA and DoRA adapters for robot manipulation video generation. Covers the full pipeline: dataset preparation with 92 robot manipulation videos, injecting LoRA adapters into the DiT's attention and feedforward layers, rectified flow training loss, optimizer/scheduler setup, and inference with fused adapter weights. Quantitative results show 100 epochs (~2.5 hours on 8× H100s) substantially improves geometric consistency, physical plausibility, and instruction following. LoRA r=32 and DoRA r=32 perform similarly; DoRA may help at very low ranks but is not necessary. Larger rank boosts instruction following but not geometric quality.

12m read timeFrom huggingface.co
Post cover image
Table of contents
Motivation Requirements Preparing Data Training VideoDataset Initialize Adapter Loss Optimizer and Scheduler Checkpointing Training Command Running Inference with Your LoRA ImageDataset Loading the Pipeline and LoRA/DoRA Weights Generating initial latent noise Inference Command Evaluation Metrics Sampson Error LLM-as-a-Judge Results Qualitative Analysis Quantitative Analysis MotivationRequirementsPreparing DataTrainingRunning Inference with Your LoRAEvaluation MetricsResults

Sort: