Part 2 of a hands-on reinforcement learning course series covering the formal foundations every RL algorithm is built on. Topics include the Markov property, MDPs as a 5-tuple, episodic vs. continuing tasks, returns and discounting, the reward hypothesis and reward hacking, deterministic and stochastic policies, state-value functions, and a complete Monte Carlo policy evaluation implementation on a 4×4 gridworld. The series contextualizes RL's growing relevance through its use in LLM post-training pipelines (RLHF, GRPO, constitutional AI) and agentic AI systems.

2m read timeFrom blog.dailydoseofds.com
Post cover image

Sort: