CS234 Lecture Notes: Foundations of RL and MDPs (Lectures 1 & 2)
I have been working through Stanford's CS234 course on Reinforcement Learning, taught by Professor Emma Brunskill, as part of building a solid theoretical foundation in RL. These are my notes from the first two lectures. Lecture 1 covers the framing of RL and builds up to Markov Reward Processes. Lecture 2 introduces Markov Decision Processes and the two core planning algorithms: policy iteration and value iteration. I have included active recall questions after each section, which I have found useful for making the material actually stick rather than just reading passively.
Lecture 1: What is Reinforcement Learning?
The Core Idea
Reinforcement learning is about learning through experience to make good decisions under uncertainty. Unlike supervised learning, there is no labeled dataset. The agent learns by interacting with an environment and receiving reward signals.
Four properties define most RL problems:
Optimization. The agent has an explicit objective: find the best possible policy, not just a good enough one.
Delayed consequences. Actions taken now affect rewards far in the future. This creates two sub-challenges: during planning, you have to reason about long-term ramifications, not just immediate effects. During learning, the credit assignment problem asks which past action actually caused a good or bad outcome.
Exploration. The agent learns about the world by acting in it. You only observe the reward for the action you took. You never see the counterfactual.
Generalization. A policy is a mapping from past experience to action. Effective RL requires generalizing from a limited number of observations to unseen states. Q1. What are the four defining properties of most RL problems? Give a one-sentence description of each. Optimization (find the best policy, not just any policy), delayed consequences (actions now affect future rewards, causing credit assignment problems), exploration (you only observe the reward for the action taken, never the counterfactual), and generalization (the policy must extend from seen to unseen states). Q2. What is the credit assignment problem and why is it hard? When a reward is received, it is unclear which earlier action caused it. In long sequences of decisions, many actions have already been taken before any reward arrives, so attributing the reward to the right action is non-trivial. Q3. How is RL fundamentally different from supervised learning? In supervised learning, labeled input-output pairs are provided. In RL, the agent must discover what actions lead to good outcomes through interaction. It receives a reward signal rather than a correct label, and it only observes the reward for the action it actually chose.Active Recall Questions
Show Answer
Show Answer
Show Answer
Sequential Decision Making
At each discrete timestep , the agent-environment loop proceeds as follows:
- Agent takes action
- World transitions, emits observation and reward
- Agent receives and , updates its history
The full history up to time is:
The agent selects its next action based on this history. In practice we compress the history into a state . Q1. Write out the history . What does it contain? . It is the full sequence of past actions, observations, and rewards up to and including timestep . Q2. Why do we compress history into a state rather than conditioning on the full history? The full history grows without bound and becomes computationally intractable to condition on. Compressing it into a state makes the problem tractable, and under the Markov assumption no information is lost.Active Recall Questions
Show Answer
Show Answer
The Markov Assumption
A state is Markov if and only if:
The future is independent of the past, given the present. Instead of conditioning on the entire history, you only need the current state. In practice, we often assume (the most recent observation is a sufficient statistic). The state representation has big implications for computational complexity, data requirements, and resulting performance. The future is independent of the past given the present. Formally: . Q2. Why is the Markov assumption popular even when it is not strictly satisfied? It is simple, and it can often be satisfied approximately by including a small window of recent history in the state. It dramatically reduces the complexity of the problem and is a useful working assumption in practice. Q3. What are the practical consequences of choosing a poor state representation? A poor state representation increases computational complexity, requires more data to learn from, and can hurt the performance of the resulting policy. If relevant history is excluded, the Markov property is violated and the agent may make systematically suboptimal decisions.Active Recall Questions
**Q1.** State the Markov condition in words and in math.Show Answer
Show Answer
Show Answer
Markov Process (Markov Chain)
A Markov Process is the simplest building block: a memoryless random process over states with no rewards and no actions. It is a tuple where:
- is a finite set of states
- is a transition model:
For states, is an matrix. Given a Markov chain, you can sample episodes: sequences of states drawn according to the transition probabilities. A Markov Process (Markov Chain) is a tuple : a set of states and a stochastic transition model. It has no actions and no rewards. It is the simplest building block on the way to an MDP. Q2. If a Markov chain has states, what is the shape of the transition matrix ? What does entry represent? is . Entry gives , the probability of transitioning from state to state in one step.Active Recall Questions
**Q1.** What is a Markov Process? What does it have, and what does it lack compared to a full MDP?Show Answer
Show Answer
Markov Reward Process (MRP)
A Markov Reward Process adds a reward signal to a Markov Chain. It is a tuple where:
- is the expected reward in state
- is the discount factor
There are still no actions.
The Return . The foundational quantity in RL. It is the actual total discounted reward received from a specific trajectory starting at time :
All value functions are simply the expected value of this return .
State Value Function . The expected return from starting in state :
Discount factor intuition:
- : agent only cares about immediate reward
- : future rewards are weighted almost as heavily as immediate rewards
- If (finite episode), it is safe to use
Active Recall Questions
Q1. Write the definition of the return as a summation. What does it represent? . It is the total discounted reward collected from timestep onwards along a specific trajectory.Show Answer
Q2. What is the relationship between and any value function? Every value function is simply conditioned on different information: , , etc.Show Answer
Q3. What does imply for agent behavior? What about ? means the agent is fully myopic and only maximizes the next immediate reward. means future rewards are valued almost as much as present ones, so the agent takes a long-term view.Show Answer
Q4. An MRP has no actions. So what is the point of studying it? An MRP is the foundation for understanding value functions and the Bellman equation before introducing the complexity of actions. It is also exactly what you get when you fix a policy in an MDP (MDP + policy = MRP).Show Answer
Two Views of the Value Function
The value function can be written in two mathematically equivalent ways.
Expectation form (global view). Defines value as the infinite sum of all future discounted rewards:
Bellman form (local / recursive view). Decomposes value into an immediate reward plus the discounted expected value of the next state:
These two forms are equivalent. The Bellman form is what makes computation tractable. Q1. What are the two equivalent forms of the value function? Name them and write both equations. The expectation form (global): . The Bellman form (local/recursive): . Q2. Why is the Bellman form more useful for computation? The Bellman form is recursive: it expresses in terms of for neighboring states. This allows iterative algorithms that update one state at a time, converging to the correct values without summing over infinite trajectories. Q3. Derive the Bellman form from the expectation form. Split . Then .Active Recall Questions
Show Answer
Show Answer
Show Answer
Computing the Value of an MRP
Analytic solution. In matrix form, , which rearranges to:
This requires a matrix inverse: complexity. Only practical for small state spaces.
Iterative solution (dynamic programming). Initialize for all . Then repeat until convergence:
Each iteration costs , which is generally preferred. Q1. Derive the analytic solution from the Bellman equation. Start from . Rearrange: , so , giving . Q2. What is the computational complexity of the analytic solution versus the iterative one? When would you prefer each? Analytic: due to matrix inversion. Iterative: per iteration. Prefer the analytic solution for small, fixed state spaces. Prefer iterative for larger state spaces or when you only need an approximate solution. Q3. What is the stopping criterion for the iterative algorithm? Stop when for some small threshold , meaning the value estimates have effectively stopped changing.Active Recall Questions
Show Answer
Show Answer
Show Answer
Agent Components
An RL agent may include any combination of:
- Model: the agent's representation of how the world transitions () and what rewards look like ()
- Policy : a mapping from states to actions. Can be deterministic , or stochastic
- Value function: quantifies the expected future return from a state or state-action pair
Model-based agents have an explicit model. Model-free agents learn a policy and/or value function directly without building one. Q1. What are the three components an RL agent may have? Does an agent need all three? Model ( and ), policy (), and value function ( or ). No, an agent can operate with any subset of these. For example, a model-free agent has no model but may have both a policy and a value function. Q2. What is the difference between a deterministic and a stochastic policy? Write both mathematically. A deterministic policy maps each state to a single action: . A stochastic policy maps each state to a distribution over actions: . Q3. What is the key distinction between a model-based and a model-free agent? A model-based agent has an explicit representation of the transition dynamics and reward function , which it uses for planning. A model-free agent does not learn or use a model; it learns a policy and/or value function directly from experience.Active Recall Questions
Show Answer
Show Answer
Show Answer
Lecture 2: MDPs, Policy Iteration, and Value Iteration
Markov Decision Process (MDP)
An MDP is an MRP with actions. It is a tuple where:
- is a finite set of actions
- is the transition model conditioned on action
- is the reward for taking action in state
MDP + policy = MRP. Given a policy , the MDP reduces to an MRP with:
This means all MRP techniques carry over directly for policy evaluation. Q1. Write the full tuple definition of an MDP. What does each element represent? : state space, action space, transition dynamics , reward function , discount factor . Q2. How does fixing a policy transform an MDP into an MRP? Write the induced and . and . The action is averaged out according to the policy, leaving a pure Markov chain with rewards. Q3. Why is the MDP-to-MRP reduction useful? It means that once we have a fixed policy, we can evaluate it using all the MRP machinery (Bellman equation, iterative DP, analytic solution), without needing new algorithms.Active Recall Questions
Show Answer
Show Answer
Show Answer
Value Functions in an MDP
With actions in the picture, there are two distinct value functions.
State-value function . The expected return starting from state and following policy :
Action-value function . The expected return starting from state , taking action , then following policy thereafter:
Interdependence of and
and are intrinsically linked. Each can be written in terms of the other:
| Function | Defined using | Equation | Meaning |
|---|---|---|---|
| Policy and | Expected -value, weighted by the policy's action probabilities | ||
| Dynamics and | Immediate reward plus discounted expected value of the next state |
Active Recall Questions
Q1. What is the difference between and ? is the expected return from state when following from the very start. is the expected return when you take a specific action first, then follow . averages over actions; conditions on a specific action.Show Answer
Q2. Express in terms of . . The state value is the policy-weighted average of the action values.Show Answer
Q3. Express in terms of . . Take action , collect the immediate reward, then add the discounted value of the resulting state.Show Answer
Q4. Why is the / interdependence important in practice? It lets you move fluidly between the two representations. Policy improvement uses to identify better actions. Policy evaluation computes . Being able to convert between them in both directions is essential for implementing both algorithms.Show Answer
Bellman Equations for and
Substituting the interdependence relations into each other gives the full Bellman expectation equations:
Q1. Write the Bellman expectation equation for and explain each term. . The first term is the immediate reward. The second term is the discounted value of the state we land in, averaged over transitions and the policy. Q2. What is the key structural insight that the Bellman equations encode? The value of a state (or state-action pair) right now can be broken into two parts: what you get immediately, and what you expect to get from here onwards. This recursive structure is what makes dynamic programming possible.Active Recall Questions
Show Answer
Show Answer
Policy Evaluation (Iterative)
Initialize for all , then apply the Bellman expectation backup repeatedly:
For a deterministic policy this simplifies to:
Continue until for some small threshold . Q1. Write the iterative policy evaluation update. What does each term do? . The outer sum averages over actions according to the policy. The inner bracket adds the immediate reward to the discounted estimated value of the next state. Q2. How does the update simplify for a deterministic policy? Since picks a single action, the sum over disappears: . Q3. What is policy evaluation computing, and why isn't it the same as finding the optimal policy? Policy evaluation computes , the value of a specific fixed policy . It tells you how good is, but it does not improve it. Finding the optimal policy requires additionally doing policy improvement (as in policy iteration) or directly optimizing (as in value iteration).Active Recall Questions
Show Answer
Show Answer
Show Answer
MDP Control
The goal of control is to find the optimal policy:
Key facts about the optimal policy in an infinite-horizon MDP:
- It is deterministic: randomness is never needed at optimality
- It is stationary: the same action is prescribed for a given state regardless of the timestep
- The optimal value function is unique, but the optimal policy is not necessarily (ties are possible)
The number of deterministic policies is , so exhaustive search is never feasible. Q1. State three properties of the optimal policy in an infinite-horizon MDP. It is deterministic, stationary (time-independent), and there exists a unique optimal value function (though there may be multiple optimal policies achieving it). Q2. Why is the optimal policy stationary in an infinite-horizon MDP but not in a finite-horizon one? In the infinite-horizon case, the remaining time is always the same (infinite), so the same decision rule applies at every step. In the finite-horizon case, the number of steps remaining decreases over time, so the optimal action in a state depends on how much time is left. Q3. Why is exhaustive policy search not feasible? The number of deterministic policies is , which grows exponentially in the state space. Even for modest problem sizes this is astronomically large.Active Recall Questions
Show Answer
Show Answer
Show Answer
Policy Iteration
Policy Iteration (PI) discovers the optimal policy by alternating between exact policy evaluation and greedy policy improvement.
Step 1: Initialization. Initialize and arbitrarily for all states. Set a small convergence threshold .
Step 2: Policy Evaluation (inner loop). Repeat until :
Step 3: Policy Improvement. For each state :
If the policy changed for any state, return to Step 2. Otherwise, stop and return and .
The Policy Improvement Theorem
Theorem. Given two deterministic policies and , if for all :
then is globally at least as good:
Proof by unrolling the Bellman equation:
Assumption:
Expand at :
- Apply the inequality again at ():
- Continue unrolling for steps:
- Take . Because , the remainder :
Convergence of PI. If the policy does not change after an improvement step, it can never change again (since implies ). Because there are at most distinct policies and each step is strictly improving, policy iteration always terminates. Q1. Describe the two phases of policy iteration in plain words. First, policy evaluation: run the Bellman expectation update iteratively until converges for the current policy. Second, policy improvement: for each state, greedily pick the action that maximizes under the current value estimates. Repeat until the policy stops changing. Q2. State the Policy Improvement Theorem formally. If for all , then for all . Q3. What is the key step in the proof? Why does the term vanish? The key step is repeatedly substituting the inequality at each successive state, unrolling the Bellman equation. The term vanishes as because causes it to decay to zero exponentially. Q4. Why does policy iteration always converge in finite time? Each iteration either strictly improves the policy or leaves it unchanged. Once unchanged, it can never change again. Since the total number of distinct deterministic policies is finite () and the algorithm never revisits a suboptimal policy, it must terminate. Q5. If policy iteration stops because the policy did not change, what can you conclude? The current policy is optimal. The Bellman optimality condition holds: for all , which means .Active Recall Questions
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Value Iteration
Value Iteration (VI) merges evaluation and improvement into a single step by applying the Bellman optimality operator directly at each iteration.
Bellman Optimality Equations. These define the maximum expected return achievable by acting optimally:
Algorithm:
- Initialize for all states. Set threshold .
- For each state , apply the Bellman optimality backup:
- Compute . If , stop.
- Extract the optimal policy greedily:
Equivalently in operator notation: .
Convergence of Value Iteration (Contraction Mapping)
VI converges because is a -contraction mapping under the infinity norm .
Proof. For any two value functions and :
Apply :
Cancel the reward (identical in both):
Bound by the worst-case difference across all states:
Since this holds for all :
Because , the distance to shrinks by a factor of at every iteration. Value iteration converges to the unique optimal value function . Q1. Write the Bellman optimality equation for . How does it differ from the Bellman expectation equation? . The difference is the instead of . Rather than averaging over the policy's action distribution, we take the best possible action. Q2. What is a contraction mapping, and why does it guarantee convergence? An operator is a contraction if with . It shrinks the distance between any two inputs by a fixed factor each time it is applied. By the Banach fixed-point theorem, repeated application converges to a unique fixed point. Q3. Sketch the key steps of the contraction proof for the Bellman operator. (1) Expand using the definition of . (2) Apply to move the max inside. (3) Cancel the reward (same in both). (4) Factor out and bound by . Q4. Does the initialization of in value iteration affect what it converges to? No. Because is a contraction, any initialization converges to the unique fixed point . Different initializations may take different numbers of iterations, but the final answer is the same.Active Recall Questions
Show Answer
Show Answer
Show Answer
Show Answer
Finite Horizon Value Iteration
For a finite horizon , the algorithm runs for exactly steps rather than until convergence:
- : optimal value with decisions remaining
- : optimal policy with decisions remaining
- Initialize , iterate to
The optimal policy in a finite-horizon problem is non-stationary: depends on how many steps remain, not just the current state. This is a fundamental difference from the infinite-horizon case. Q1. What does represent in finite-horizon value iteration? It is the optimal expected return from state when exactly decisions remain. Q2. Why is the optimal policy non-stationary in the finite-horizon case? With a finite horizon, the right action depends on how many steps are left. Near the end of an episode, the agent may prefer to exploit immediately; earlier in the episode, it may be worth taking actions that sacrifice short-term reward for longer-term gain. The remaining time is part of the decision context.Active Recall Questions
Show Answer
Show Answer
Policy Iteration vs Value Iteration
| Feature | Policy Iteration | Value Iteration |
|---|---|---|
| Focus | Improving a specific policy | Finding the optimal value directly |
| Update rule | Bellman expectation, then | Bellman optimality () at every step |
| Phases | Two distinct steps: evaluate then improve | One merged step using the operator |
| Convergence | Monotonically improving; terminates in $\leq | \mathcal{A} |
| Cost per step | Higher (full policy evaluation per iteration) | Lower (one Bellman backup per state) |
Active Recall Questions
Q1. What is the fundamental difference in the update rule between PI and VI? PI applies the Bellman expectation operator (weighted sum over 's action distribution) for evaluation, then a separate for improvement. VI applies the Bellman optimality operator () in a single step, merging both.Show Answer
Q2. Which algorithm typically requires fewer outer iterations and why? Policy iteration, because it fully evaluates the current policy before improving it. Each improvement step makes a globally informed update. Value iteration mixes partial evaluation and improvement, converging more slowly in terms of outer iterations but with cheaper per-step cost.Show Answer
Q3. Both algorithms converge to . What property of policy iteration makes this obvious? What property of value iteration makes this obvious? For PI: monotonic improvement guarantees we never go backwards, and termination is finite because the policy space is finite. For VI: the Bellman operator is a contraction, so and the greedy policy extracted from is optimal.Show Answer
I will keep adding notes as I work through the rest of the course, along with my solutions to each assignment. More to come.