Modern LLM fine-tuning in 2026 no longer requires manually curated datasets or hand-crafted reward functions. The post explains Reinforcement Fine-Tuning (RFT) using GRPO (Group Relative Policy Optimization), the algorithm behind DeepSeek-R1, which trains models by generating multiple completions and reinforcing above-average behaviors. ART (Agent Reinforcement Trainer) is introduced as an open-source framework that applies GRPO to real-world multi-step agents with tool calls, supporting LangGraph, CrewAI, and ADK integrations. RULER (Relative Universal LLM-Elicited Rewards) eliminates the need for manual reward functions by using an LLM-as-judge to rank agent trajectories comparatively. A practical notebook example trains a 3B model on MCP server tasks using automatic RULER evaluation.
Table of contents
How to fine-tune LLMs in 202612 must-use features in Claude Code P.S. For those wanting to develop “Industry ML” expertise:Sort: