rakshit

CS234 Assignment 1, Part 1: Horizon, Discounting, and Reward Hacking

A few weeks ago I started working through Stanford's CS234 course on Reinforcement Learning. I have been meaning to get into RL properly for a while and decided that working through a real course with real problem sets was the best way to do it. This post covers my solutions to Problems 1 and 2 from Assignment 1. A follow-up post will cover Problems 3 and 4.

The full problem set is available on the CS234 course website.


Problem 1: How Horizon Shapes the Optimal Policy

This problem builds intuition for how the time horizon of an MDP shapes the optimal policy. The setup is a simple inventory management environment where the agent controls stock level s{0,1,,10}s \in \{0, 1, \dots, 10\}. Each sell action yields +1+1 reward and transitions ss1s \to s-1. Buying transitions ss+1s \to s+1 with no reward, except for the final step s=9s=10s = 9 \to s = 10, which yields +100+100. The state s=10s = 10 is terminal. The stock always starts at s=3s = 3.

(a) Can the optimal policy both buy and sell?

Yes. With H=9H = 9, the agent can sell once (s=3s=2s = 3 \to s = 2, earning +1+1), then buy eight consecutive times to reach s=10s = 10 and collect the +100+100 bonus. The total reward is 101101.

Compare this to buying directly from s=3s = 3, which takes only 7 steps and yields +100+100 total. The sell-first strategy is strictly better when the time budget allows it, and H=9H = 9 is a concrete example where the optimal policy uses both actions.

(b) Can we choose γ\gamma so the agent never fully stocks?

Yes. Setting γ=0\gamma = 0 forces the agent to optimize only for immediate reward. Selling yields +1+1 now while buying yields 00. So the agent sells until s=0s = 0 and then stays there, since neither action produces immediate reward at s=0s = 0. The inventory never reaches s=10s = 10.

More formally, with γ=0\gamma = 0 the value function collapses to:

Vπ(s)=E[Rt+1St=s]V^\pi(s) = \mathbb{E}[R_{t+1} \mid S_t = s]

which makes the +100+100 terminal reward completely invisible to the agent.

(c) Do the two formulations ever agree?

This part asks whether there exists a γ\gamma and HH such that the optimal policy for the infinite-horizon discounted MDP matches the optimal policy for the finite-horizon undiscounted MDP.

Yes. Take γ=0\gamma = 0 and H=1H = 1, starting from s=3s = 3.

With H=1H = 1, the agent has one step. Selling gives +1+1, buying gives 00. The optimal action is sell.

With γ=0\gamma = 0, the agent only sees immediate reward. The optimal action is also sell.

Both formulations agree: π(s=3)=sell\pi^*(s=3) = \text{sell}.

(d) Does this always hold for any HH?

No. A matching γ\gamma is not guaranteed to exist for every HH.

The core issue is that infinite-horizon MDPs produce stationary optimal policies: π:SA\pi^*: \mathcal{S} \to \mathcal{A} maps states to actions with no dependence on time. Finite-horizon MDPs produce non-stationary policies πt:SA\pi^*_t: \mathcal{S} \to \mathcal{A} that depend on how many steps remain.

For large HH, the finite-horizon optimal policy is explicitly time-dependent. Early in the episode, with many steps remaining, the agent should buy to eventually collect the +100+100 bonus. Late in the episode, with few steps left, it should sell to collect +1+1 rewards before the horizon expires. No fixed γ\gamma can replicate this time-varying behavior via a stationary policy for a general HH.


Problem 2: When Proxy Rewards Go Wrong

This problem explores how a well-intentioned proxy reward can lead to an agent that is technically optimal but completely misaligned with human intent.

The setup: one AI-controlled car (red) navigates alongside many human-driven cars on a highway. The true goal is to minimize mean commute time. The proxy reward used instead is to maximize mean velocity across all cars.

(a) Why does the optimal policy park and not merge?

Merging requires nearby human-driven cars to slow down to create a gap. This reduction in velocity directly lowers the proxy reward. The AI car discovers that staying parked is the better strategy: the human-driven cars maintain their speed unimpeded, and mean velocity stays high.

Let vˉ\bar{v} denote mean velocity across all nn cars. The proxy reward is R=vˉR = \bar{v}. At the moment of merging, the temporary slowdown caused by surrounding cars braking gives:

vˉmerge<vˉparked\bar{v}_{\text{merge}} < \bar{v}_{\text{parked}}

So the agent is not behaving irrationally. It is just that R=vˉR = \bar{v} does not capture what we actually want.

(b) A better reward function

One alternative is to reward the AI car directly for merging promptly. Define:

R=tmergeR = -t_{\text{merge}}

where tmerget_{\text{merge}} is the number of timesteps before the car successfully enters the highway. If the car never merges, tmerget_{\text{merge}} \to \infty and RR \to -\infty, which forces the agent to merge eventually.

To address safety, a crash state can be introduced with a large negative reward Rcrash=CR_{\text{crash}} = -C for some large constant CC. The combined reward:

R=tmerge+Rcrash1[crash]R = -t_{\text{merge}} + R_{\text{crash}} \cdot \mathbf{1}[\text{crash}]

captures both the goal (merge) and the safety constraint (do it without causing a collision), while sidestepping the mean velocity proxy entirely.


What's Next

Problems 3 and 4 go considerably deeper. Problem 3 covers Bellman residuals and performance bounds, proving properties of the Bellman backup operator and deriving guarantees on policy quality from an arbitrary value function VV. Problem 4 is an implementation task: building value iteration and policy iteration from scratch on the RiverSwim MDP.

I will cover those in the next post. I also have a separate post in draft on the RL theory background covering value functions, Bellman equations, and the convergence proofs for both algorithms. That one is a study reference rather than a problem walkthrough.