This talk will be a technical deep dive into RL for agentic reasoning via multi-turn tool calling, similar to OpenAI's o3 and Deep Research. In particular, we'll cover:

- When, why, and how
- GRPO vs PPO vs etc
- Designing environments and rewards
- Survey of recent research highlights
- Results on example tasks
- Overview of open-source ecosystem (libraries, compute requirements, tradeoffs, etc.)

About Will Brown
Will Brown is a Research Engineering Lead at Prime Intellect, focusing on RL for reasoning and agents. He previously held research roles at Morgan Stanley and AWS, and completed his PhD in Computer Science at Columbia University.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter

Timestamps
[00:00] Introduction to the idea that reasoning and agents are similar.
[01:05] The growing effectiveness of Reinforcement Learning (RL) in AI.
[03:04] The complexities and challenges of implementing RL.
[04:41] The connection between popular AI products (agents) and RL fine-tuning.
[07:18] The core process of Reinforcement Learning.
[10:21] The importance of tools and real-world tasks for agents.
[12:13] The problem of "reward hacking" and how to design better evaluations.
[14:51] Future directions for agentic systems and a practical toolkit for implementation.

AI Engineer

Reinforcement learning (RL) has become a viable approach for training agentic AI systems, with companies like DeepSeek demonstrating its effectiveness at scale. The key insight is that building agents and training reasoning models are fundamentally the same problem - both involve iterative interaction loops with environments and evaluation systems. While RL implementation can be complex, new tools and frameworks are making it more accessible to startups and individual researchers. The future of powerful agents likely requires moving beyond simple API wrappers to custom-trained models that can handle multi-turn interactions, tool usage, and complex reward structures. Success depends on designing good evaluation systems and reward functions that capture the intended behavior without being easily gamed.