Best of Reinforcement Learning2025

  1. 1
    Article
    Avatar of sebastianraschkaSebastian Raschka·15w

    The State Of LLMs 2025: Progress, Problems, and Predictions

    A comprehensive 2025 review of large language model developments highlights reinforcement learning with verifiable rewards (RLVR) and the GRPO algorithm as the year's dominant training paradigm, following DeepSeek R1's breakthrough. Key trends include inference-time scaling, tool use integration, and architectural efficiency tweaks like mixture-of-experts and linear attention mechanisms. The analysis addresses benchmarking challenges ("benchmaxxing"), discusses practical LLM usage for coding and writing, and examines the shift toward domain-specific models with proprietary data. Predictions for 2026 emphasize RLVR expansion beyond math/code, increased inference optimization, and the emergence of diffusion models for low-latency tasks.

  2. 2
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·38w

    4 Stages of Training LLMs from Scratch

    Training large language models from scratch involves four key stages: pre-training on massive text corpora to learn language basics, instruction fine-tuning to make models conversational and follow commands, preference fine-tuning using human feedback (RLHF) to align with human preferences, and reasoning fine-tuning for mathematical and logical tasks using correctness as a reward signal. Each stage builds upon the previous one to create increasingly capable and aligned AI systems.

  3. 3
    Video
    Avatar of cozmouzcozmouz·48w

    I Trapped this AI Centipede in a Simulation for 1000 Years

    The post explores the creation and training of an AI centipede to exhibit realistic locomotion using proximal policy optimization and neural networks. The AI learns a metachronal gait, mimicking real-life centipedes, and adapts to external challenges, enhancing its movement capabilities. Interactive lessons from Brilliant are highlighted as effective learning tools for programming and AI concepts.

  4. 4
    Article
    Avatar of mlmMachine Learning Mastery·19w

    The Roadmap for Mastering Agentic AI in 2026

    A comprehensive learning path for building autonomous AI systems that can plan, reason, and act independently. Covers foundational mathematics and programming, machine learning fundamentals, autonomous agent architectures, specialization areas like robotics and workflow automation, deployment strategies using Docker and cloud platforms, and portfolio development. Includes curated resources from beginner prerequisites through advanced topics like multi-agent systems, transformer-based decision-making, and reinforcement learning with human feedback.

  5. 5
    Article
    Avatar of medium_jsMedium·1y

    Mathematical Foundation Underpinning Reinforcement Learning

    Reinforcement learning (RL) is inspired by the process of learning from experience, with the Soft Actor-Critic (SAC) algorithm being a popular framework. This post discusses the mathematical foundation of SAC agents, detailing the actor (policy) and critic networks. The actor network uses a neural network to estimate actions and their probabilities while the critic network estimates the expected return of action-state pairs. Python code snippets in PyTorch demonstrate the implementation of these networks and their integration into a RL model.

  6. 6
    Article
    Avatar of sebastianraschkaSebastian Raschka·18w

    From Random Forests to RLVR: A Short History of ML/AI Hello Worlds

    A chronological overview traces the evolution of beginner-friendly ML/AI examples from 2013 to 2025. Starting with Random Forests on Iris datasets and XGBoost on Kaggle competitions, it progresses through neural networks (MLPs, AlexNet), transformer models (DistilBERT, Llama 2 with LoRA), and culminates with reasoning models using RLVR on mathematical datasets. Each milestone reflects when methods became mainstream and accessible, often lagging years behind their initial publication due to tooling maturity and community adoption.

  7. 7
    Video
    Avatar of computerphileComputerphile·1y

    Solve Markov Decision Processes with the Value Iteration Algorithm - Computerphile

    The value iteration algorithm is a method for solving Markov decision processes (MDPs) to produce optimal action decisions. MDPs model decision-making problems, particularly those under uncertainty. The algorithm iteratively computes the values of states to find the policy that minimizes cost or maximizes reward. It is essential for decision-making models where dynamic programming techniques are applied to achieve the best outcome.

  8. 8
    Article
    Avatar of sebastianraschkaSebastian Raschka·38w

    LLM Research Papers: The 2025 List (January to June)

    A curated collection of over 200 LLM research papers from the first half of 2025, organized by topic rather than chronologically. The list focuses heavily on reasoning models, covering training strategies, inference-time scaling methods, and evaluation approaches. Major categories include reinforcement learning methods, efficient training architectures, multimodal models, and diffusion-based language models. The author plans to provide detailed analysis of key papers in future articles.

  9. 9
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·1y

    Guardrails for AI Agents

    The post explains how reinforcement fine-tuning (RFT) enhances open-source LLMs, offering accuracy gains and efficient fine-tuning with few examples. It also details implementing guardrails for AI agents to prevent issues like hallucination and infinite loops. The guide walks through setting up validation checkpoints, limiting tool usage, and specifying fallback mechanisms with practical code examples.

  10. 10
    Video
    Avatar of davesgarageDave's Garage·44w

    Vibe Coding 101: Writing an AI with AI

    A developer demonstrates live coding of an AI system that learns to play the classic arcade game Tempest through reinforcement learning. The project involves running multiple game instances on a 96-core machine, extracting game state data from memory, and iteratively refining reward functions to improve AI performance. The AI currently reaches level 33 but struggles with faster yellow levels, prompting adjustments to encourage more cautious gameplay behavior.

  11. 11
    Article
    Avatar of hnHacker News·1y

    Jiayi-Pan/TinyZero

    TinyZero is based on DeepSeek R1 Zero, enhanced with veRL. Using reinforcement learning, it demonstrates the development of self-verification and search abilities in a 3B base LM. The project can be experimented with for less than $30.

  12. 12
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·32w

    Build a Reasoning LLM using GRPO

    Group Relative Policy Optimization (GRPO) is a reinforcement learning method that fine-tunes large language models for math and reasoning tasks using deterministic reward functions, eliminating the need for labeled data. The process involves generating multiple candidate responses, assigning rewards based on deterministic functions, and using GRPO loss to update the model through backpropagation. A practical implementation demonstrates using UnslothAI and HuggingFace TRL to transform a base model into a reasoning-capable system, with reward functions that validate response format and correctness without manual labeling.

  13. 13
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·1y

    [Hands-on] Build Your Reasoning LLM

    Reinforcement fine-tuning (RFT) allows the transformation of open-source LLMs into advanced reasoning models without needing labeled data. The post guides using Predibase for RFT to enhance Qwen-2.5:7b. It contrasts RFT with supervised fine-tuning (SFT), highlights the steps involved in setting up and training using the Countdown dataset, and explains the reward functions used for model evaluation.

  14. 14
    Article
    Avatar of diamantaiDiamantAI·1y

    Reinforcement Learning Explained

    Reinforcement learning involves teaching an AI to adapt and learn by interacting with its environment. Key topics include agents & environment, policy, Q-learning, the exploration-exploitation dilemma, function approximation & memory, hierarchical methods, meta-learning, and multi-agent setups.