Training AI agents to operate command-line interfaces safely using synthetic data generation and reinforcement learning with verifiable rewards (RLVR). The tutorial demonstrates fine-tuning NVIDIA Nemotron-Nano-9B-V2 to control the LangGraph CLI through a human-in-the-loop architecture. The approach combines NeMo Data Designer for generating validated training examples, NeMo Gym for building RL environments, and Unsloth for efficient GRPO-based training on a single GPU. The method enables rapid specialization of language models to new CLI tools without waiting for real-world usage data, while maintaining safety through multi-layered verification and mandatory human approval before command execution.
Table of contents
What you’ll build: a specialized agent to run a new CLI toolWhy use synthetic data generation and reinforcement learning to teach a new CLI?PrerequisitesStep 1: Design a synthetic dataset with NeMo Data DesignerStep 2: Fine-tune with RLVR (using GRPO)Step 3: Human-in-the-loop executionWhy RLVR + synthetic data work for customizing Agentic AIClosing thoughtsSort: