Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

A practical guide to parameter-efficient fine-tuning of NVIDIA Cosmos Predict 2.5, a 2B-parameter world model, using LoRA and DoRA adapters for robot manipulation video generation. Covers the full pipeline: dataset preparation with 92 robot manipulation videos, injecting LoRA adapters into the DiT's attention and feedforward layers, rectified flow training loss, optimizer/scheduler setup, and inference with fused adapter weights. Quantitative results show 100 epochs (~2.5 hours on 8× H100s) substantially improves geometric consistency, physical plausibility, and instruction following. LoRA r=32 and DoRA r=32 perform similarly; DoRA may help at very low ranks but is not necessary. Larger rank boosts instruction following but not geometric quality.

#robotics

#video-generation

#lora

May 18•12m read time•From huggingface.co

Table of contents

Motivation Requirements Preparing Data Training VideoDataset Initialize Adapter Loss Optimizer and Scheduler Checkpointing Training Command Running Inference with Your LoRA ImageDataset Loading the Pipeline and LoRA/DoRA Weights Generating initial latent noise Inference Command Evaluation Metrics Sampson Error LLM-as-a-Judge Results Qualitative Analysis Quantitative Analysis Motivation Requirements Preparing Data Training Running Inference with Your LoRA Evaluation Metrics Results

Comment

Bookmark

Copy

Sort: