Best of Reinforcement Learning — 2025

1
Article
Sebastian Raschka·22w
The State Of LLMs 2025: Progress, Problems, and Predictions
A comprehensive 2025 review of large language model developments highlights reinforcement learning with verifiable rewards (RLVR) and the GRPO algorithm as the year's dominant training paradigm, following DeepSeek R1's breakthrough. Key trends include inference-time scaling, tool use integration, and architectural efficiency tweaks like mixture-of-experts and linear attention mechanisms. The analysis addresses benchmarking challenges ("benchmaxxing"), discusses practical LLM usage for coding and writing, and examines the shift toward domain-specific models with proprietary data. Predictions for 2026 emphasize RLVR expansion beyond math/code, increased inference optimization, and the emergence of diffusion models for low-latency tasks.
176
1
2
Article
Daily Dose of Data Science | Avi Chawla | Substack·45w
4 Stages of Training LLMs from Scratch
Training large language models from scratch involves four key stages: pre-training on massive text corpora to learn language basics, instruction fine-tuning to make models conversational and follow commands, preference fine-tuning using human feedback (RLHF) to align with human preferences, and reasoning fine-tuning for mathematical and logical tasks using correctness as a reward signal. Each stage builds upon the previous one to create increasingly capable and aligned AI systems.
104
2
3
Video
cozmouz·1y
I Trapped this AI Centipede in a Simulation for 1000 Years
The post explores the creation and training of an AI centipede to exhibit realistic locomotion using proximal policy optimization and neural networks. The AI learns a metachronal gait, mimicking real-life centipedes, and adapts to external challenges, enhancing its movement capabilities. Interactive lessons from Brilliant are highlighted as effective learning tools for programming and AI concepts.
66
1
4
Article
Machine Learning Mastery·26w
The Roadmap for Mastering Agentic AI in 2026
A comprehensive learning path for building autonomous AI systems that can plan, reason, and act independently. Covers foundational mathematics and programming, machine learning fundamentals, autonomous agent architectures, specialization areas like robotics and workflow automation, deployment strategies using Docker and cloud platforms, and portfolio development. Includes curated resources from beginner prerequisites through advanced topics like multi-agent systems, transformer-based decision-making, and reinforcement learning with human feedback.
57
4
5
Article
Medium·1y
Mathematical Foundation Underpinning Reinforcement Learning
Reinforcement learning (RL) is inspired by the process of learning from experience, with the Soft Actor-Critic (SAC) algorithm being a popular framework. This post discusses the mathematical foundation of SAC agents, detailing the actor (policy) and critic networks. The actor network uses a neural network to estimate actions and their probabilities while the critic network estimates the expected return of action-state pairs. Python code snippets in PyTorch demonstrate the implementation of these networks and their integration into a RL model.
33
6
Article
Sebastian Raschka·25w
From Random Forests to RLVR: A Short History of ML/AI Hello Worlds
A chronological overview traces the evolution of beginner-friendly ML/AI examples from 2013 to 2025. Starting with Random Forests on Iris datasets and XGBoost on Kaggle competitions, it progresses through neural networks (MLPs, AlexNet), transformer models (DistilBERT, Llama 2 with LoRA), and culminates with reasoning models using RLVR on mathematical datasets. Each milestone reflects when methods became mainstream and accessible, often lagging years behind their initial publication due to tooling maturity and community adoption.
32
7
Video
Computerphile·1y
Solve Markov Decision Processes with the Value Iteration Algorithm - Computerphile
The value iteration algorithm is a method for solving Markov decision processes (MDPs) to produce optimal action decisions. MDPs model decision-making problems, particularly those under uncertainty. The algorithm iteratively computes the values of states to find the policy that minimizes cost or maximizes reward. It is essential for decision-making models where dynamic programming techniques are applied to achieve the best outcome.
23
8
Article
Sebastian Raschka·45w
LLM Research Papers: The 2025 List (January to June)
A curated collection of over 200 LLM research papers from the first half of 2025, organized by topic rather than chronologically. The list focuses heavily on reasoning models, covering training strategies, inference-time scaling methods, and evaluation approaches. Major categories include reinforcement learning methods, efficient training architectures, multimodal models, and diffusion-based language models. The author plans to provide detailed analysis of key papers in future articles.
20
9
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Guardrails for AI Agents
The post explains how reinforcement fine-tuning (RFT) enhances open-source LLMs, offering accuracy gains and efficient fine-tuning with few examples. It also details implementing guardrails for AI agents to prevent issues like hallucination and infinite loops. The guide walks through setting up validation checkpoints, limiting tool usage, and specifying fallback mechanisms with practical code examples.
16
10
Video
Dave's Garage·51w
Vibe Coding 101: Writing an AI with AI
A developer demonstrates live coding of an AI system that learns to play the classic arcade game Tempest through reinforcement learning. The project involves running multiple game instances on a 96-core machine, extracting game state data from memory, and iteratively refining reward functions to improve AI performance. The AI currently reaches level 33 but struggles with faster yellow levels, prompting adjustments to encourage more cautious gameplay behavior.
15
11
Article
Hacker News·1y
Jiayi-Pan/TinyZero
TinyZero is based on DeepSeek R1 Zero, enhanced with veRL. Using reinforcement learning, it demonstrates the development of self-verification and search abilities in a 3B base LM. The project can be experimented with for less than $30.
15
12
Article
Daily Dose of Data Science | Avi Chawla | Substack·38w
Build a Reasoning LLM using GRPO
Group Relative Policy Optimization (GRPO) is a reinforcement learning method that fine-tunes large language models for math and reasoning tasks using deterministic reward functions, eliminating the need for labeled data. The process involves generating multiple candidate responses, assigning rewards based on deterministic functions, and using GRPO loss to update the model through backpropagation. A practical implementation demonstrates using UnslothAI and HuggingFace TRL to transform a base model into a reasoning-capable system, with reward functions that validate response format and correctness without manual labeling.
13
13
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
[Hands-on] Build Your Reasoning LLM
Reinforcement fine-tuning (RFT) allows the transformation of open-source LLMs into advanced reasoning models without needing labeled data. The post guides using Predibase for RFT to enhance Qwen-2.5:7b. It contrasts RFT with supervised fine-tuning (SFT), highlights the steps involved in setting up and training using the Countdown dataset, and explains the reward functions used for model evaluation.
13
14
Article
DiamantAI·1y
Reinforcement Learning Explained
Reinforcement learning involves teaching an AI to adapt and learn by interacting with its environment. Key topics include agents & environment, policy, Q-learning, the exploration-exploitation dilemma, function approximation & memory, hierarchical methods, meta-learning, and multi-agent setups.
10

See all Reinforcement Learning archives