rakshit

Deep RL Diary: Implementing Policy Gradients on CartPole-v1 from Scratch

In reinforcement learning, policy gradient methods directly parameterize the policy πθ(as)\pi_\theta(a|s) and optimize it to maximize the expected cumulative reward J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]. While mathematically elegant, policy gradient methods are notoriously sensitive to hyperparameters, prone to high-variance gradient estimates, and vulnerable to training instability.

This post documents my investigative journey of implementing these algorithms on CartPole-v1 in PyTorch. I started from the basic single-episode REINFORCE algorithm, scaled up to vectorized environments with A2C + GAE, implemented Proximal Policy Optimization (PPO), and encountered a series of subtle implementation traps that completely broke learning in the vectorized versions before resolving them.

Here is the step-by-step log of what worked, what failed, and why.


Step 1: The Basics — REINFORCE (No Baseline)

The foundation of policy optimization is the Policy Gradient Theorem. The gradient of the expected return is given by:

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) G_t \right]

where Gt=k=0Ttγkrt+kG_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k} is the Monte Carlo return from step tt.

In my first implementation, I used a vectorized environment to parallelize rollouts, but updated the policy after collecting full episodes. Here is the code for the basic REINFORCE algorithm without baseline:

def run_reinforce_no_baseline(
    episodes: int = 1000,
    seed: int = 0,
    n_envs: int = N_ENVS,
    gamma: float = 0.99,
    lr: float = 0.01,
    save_path: str = None
) -> list[float]:
    print("  Running REINFORCE (No Baseline)...")
    envs = gym.vector.SyncVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(n_envs)])
    obs, _ = envs.reset(seed=seed)

    state_size: int = envs.single_observation_space.shape[0]
    action_size: int = envs.single_action_space.n

    policy = PolicyNetwork(state_size, action_size).to(device)
    optimizer = optim.Adam(policy.parameters(), lr=lr)

    env_states:  list[list[np.ndarray]] = [[] for _ in range(n_envs)]
    env_actions: list[list[int]]        = [[] for _ in range(n_envs)]
    env_rewards: list[list[float]]      = [[] for _ in range(n_envs)]
    completed: list[float] = []

    while len(completed) < episodes:
        with t.no_grad():
            obs_t: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
            probs: Float[Tensor, "env action"] = policy(obs_t)
            m = t.distributions.Categorical(probs)
            action: Int[Tensor, "env"] = m.sample()

        for i in range(n_envs):
            env_states[i].append(obs[i].copy())
            env_actions[i].append(action[i].item())

        obs, reward, terminated, truncated, _ = envs.step(action.cpu().numpy())
        done = terminated | truncated

        for i in range(n_envs):
            env_rewards[i].append(reward[i])

        done_indices = np.where(done)[0]
        if len(done_indices) > 0:
            all_losses: list[Float[Tensor, ""]] = []

            for i in done_indices:
                ep_rewards = env_rewards[i]
                completed.append(sum(ep_rewards))

                returns: list[float] = []
                G = 0.0
                for r in reversed(ep_rewards):
                    G = r + gamma * G
                    returns.append(G)
                returns.reverse()
                G_t: Float[Tensor, "time"] = t.tensor(returns, dtype=t.float32, device=device)
                # Normalize returns to scale gradients
                G_t = (G_t - G_t.mean()) / (G_t.std() + 1e-8)

                ep_obs: Float[Tensor, "time state"] = t.tensor(
                    np.array(env_states[i]), dtype=t.float32, device=device
                )
                ep_act: Int[Tensor, "time"] = t.tensor(env_actions[i], dtype=t.long, device=device)
                ep_probs: Float[Tensor, "time action"] = policy(ep_obs)
                ep_m = t.distributions.Categorical(ep_probs)
                log_pi: Float[Tensor, "time"] = ep_m.log_prob(ep_act)

                all_losses.append(-reduce(G_t * log_pi, 'time -> ()', 'sum'))

                env_states[i] = []
                env_actions[i] = []
                env_rewards[i] = []

            total_loss: Float[Tensor, ""] = t.stack(all_losses).mean()
            optimizer.zero_grad()
            total_loss.backward()
            optimizer.step()

    envs.close()
    if save_path:
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        t.save(policy.state_dict(), save_path)
    return completed[:episodes]

Visualizing Performance: Random vs. REINFORCE (No Baseline)

Before training, a Random Agent wiggles aimlessly and fails in under 15 steps. Once trained with REINFORCE (No Baseline), the agent manages to balance the pole for the full duration of 500 steps, though with visible jittering.

Random Agent (Failed)REINFORCE No Baseline (Balanced)
Random AgentREINFORCE No Baseline

Step 2: Variance Reduction — REINFORCE (With Baseline)

Because GtG_t is computed from a single full rollout trajectory, basic REINFORCE is highly sensitive to environment stochasticity. To reduce variance without introducing bias, we subtract a state-dependent baseline Vϕ(st)V_\phi(s_t) from the return. The advantage estimator becomes At=GtVϕ(st)A_t = G_t - V_\phi(s_t).

Mathematically, subtracting any baseline V(s)V(s) that does not depend on the action ata_t keeps the gradient unbiased because the expected value of the baseline gradient is zero:

Eatπ[θlogπθ(atst)V(st)]=aθπθ(ast)V(st)=V(st)θaπθ(ast)=V(st)θ(1)=0\mathbb{E}_{a_t \sim \pi} [\nabla_\theta \log \pi_\theta(a_t|s_t) V(s_t)] = \sum_{a} \nabla_\theta \pi_\theta(a|s_t) V(s_t) = V(s_t) \nabla_\theta \sum_{a} \pi_\theta(a|s_t) = V(s_t) \nabla_\theta (1) = 0

In code, I trained a concurrent critic network Vϕ(s)V_\phi(s) using mean-squared error to predict the returns:

def run_reinforce_with_baseline(
    episodes: int = 1000,
    seed: int = 0,
    n_envs: int = N_ENVS,
    gamma: float = 0.99,
    lr_policy: float = 0.01,
    lr_value: float = 0.01,
    save_path: str = None
) -> list[float]:
    print("  Running REINFORCE (With Baseline)...")
    envs = gym.vector.SyncVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(n_envs)])
    obs, _ = envs.reset(seed=seed)

    state_size: int = envs.single_observation_space.shape[0]
    action_size: int = envs.single_action_space.n

    policy = PolicyNetwork(state_size, action_size).to(device)
    value_net = ValueNetwork(state_size).to(device)
    all_params = list(policy.parameters()) + list(value_net.parameters())
    optimizer = optim.Adam(all_params, lr=lr_policy)

    env_states:  list[list[np.ndarray]] = [[] for _ in range(n_envs)]
    env_actions: list[list[int]]        = [[] for _ in range(n_envs)]
    env_rewards: list[list[float]]      = [[] for _ in range(n_envs)]
    completed: list[float] = []

    while len(completed) < episodes:
        with t.no_grad():
            obs_t: Float[Tensor, "env state"]  = t.tensor(obs, dtype=t.float32, device=device)
            probs: Float[Tensor, "env action"] = policy(obs_t)
            m = t.distributions.Categorical(probs)
            action: Int[Tensor, "env"] = m.sample()

        for i in range(n_envs):
            env_states[i].append(obs[i].copy())
            env_actions[i].append(action[i].item())

        obs, reward, terminated, truncated, _ = envs.step(action.cpu().numpy())
        done = terminated | truncated

        for i in range(n_envs):
            env_rewards[i].append(reward[i])

        done_indices = np.where(done)[0]
        if len(done_indices) > 0:
            all_losses: list[Float[Tensor, ""]] = []

            for i in done_indices:
                ep_rewards = env_rewards[i]
                completed.append(sum(ep_rewards))

                returns_list: list[float] = []
                G = 0.0
                for r in reversed(ep_rewards):
                    G = r + gamma * G
                    returns_list.append(G)
                returns_list.reverse()
                G_t: Float[Tensor, "time"] = t.tensor(returns_list, dtype=t.float32, device=device)

                ep_obs: Float[Tensor, "time state"] = t.tensor(
                    np.array(env_states[i]), dtype=t.float32, device=device
                )
                ep_act: Int[Tensor, "time"] = t.tensor(env_actions[i], dtype=t.long, device=device)

                ep_probs: Float[Tensor, "time action"] = policy(ep_obs)
                ep_m = t.distributions.Categorical(ep_probs)
                log_pi: Float[Tensor, "time"] = ep_m.log_prob(ep_act)
                V_t:    Float[Tensor, "time"] = rearrange(value_net(ep_obs), 'time 1 -> time')

                A_t: Float[Tensor, "time"] = G_t - V_t.detach()

                policy_loss: Float[Tensor, ""] = -reduce(A_t * log_pi, 'time -> ()', 'sum')
                value_loss:  Float[Tensor, ""] = reduce((V_t - G_t) ** 2, 'time -> ()', 'mean')
                all_losses.append(policy_loss + value_loss)

                env_states[i] = []
                env_actions[i] = []
                env_rewards[i] = []

            total_loss: Float[Tensor, ""] = t.stack(all_losses).mean()
            optimizer.zero_grad()
            total_loss.backward()
            optimizer.step()

    envs.close()
    if save_path:
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        t.save(policy.state_dict(), save_path)
    return completed[:episodes]

Visualizing Performance: REINFORCE (With Baseline)

With the baseline reducing gradient variance, training stabilizes, and the resulting policy balances the pole for 362 steps in evaluation.

REINFORCE With Baseline


Step 3: Scaling Up — Vectorized A2C + GAE

Instead of waiting for an entire episode to finish (Monte Carlo), online Actor-Critic methods update the policy using temporal difference (TD) learning. To balance bias and variance, we can compute Generalized Advantage Estimation (GAE) across a fixed rollout window (T=32T = 32) across parallel environments (N=16N = 16).

GAE introduces a mixing parameter λ[0,1]\lambda \in [0, 1]:

A^tGAE(γ,λ)=l=0(γλ)lδt+lV\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}^V

where δtV=rt+γVϕ(st+1)Vϕ(st)\delta_t^V = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) is the TD error.

Here is the vectorized GAE implementation, followed by the main A2C training loop:

def compute_gae(
    rewards: Float[Tensor, "step env"],
    values: Float[Tensor, "step env"],
    dones: Float[Tensor, "step env"],
    next_value: Float[Tensor, "env"],
    gamma: float,
    lmbda: float,
) -> tuple[Float[Tensor, "step env"], Float[Tensor, "step env"]]:
    n_steps = rewards.shape[0]
    advantages: Float[Tensor, "step env"] = t.zeros_like(rewards)
    gae: Float[Tensor, "env"] = t.zeros_like(next_value)

    for step in reversed(range(n_steps)):
        next_val = next_value if step == n_steps - 1 else values[step + 1]
        not_done: Float[Tensor, "env"] = 1.0 - dones[step]
        delta: Float[Tensor, "env"] = rewards[step] + gamma * next_val * not_done - values[step]
        gae = delta + gamma * lmbda * not_done * gae
        advantages[step] = gae

    returns: Float[Tensor, "step env"] = advantages + values
    return advantages, returns
def run_a2c_gae(
    episodes: int = 1000,
    seed: int = 0,
    n_envs: int = N_ENVS,
    n_steps: int = N_STEPS,
    gamma: float = 0.99,
    lmbda: float = 0.95,
    lr: float = 0.01,
    save_path: str = None
) -> list[float]:
    print("  Running A2C + GAE (vectorized)...")
    envs = gym.vector.SyncVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(n_envs)])
    obs, _ = envs.reset(seed=seed)

    state_size: int = envs.single_observation_space.shape[0]
    action_size: int = envs.single_action_space.n

    policy = PolicyNetwork(state_size, action_size).to(device)
    value_net = ValueNetwork(state_size).to(device)
    all_params = list(policy.parameters()) + list(value_net.parameters())
    optimizer = optim.Adam(all_params, lr=lr)

    running_rewards = np.zeros(n_envs)
    completed_rewards: list[float] = []

    while len(completed_rewards) < episodes:
        mb_obs     = t.zeros(n_steps, n_envs, state_size, device=device)
        mb_actions = t.zeros(n_steps, n_envs, dtype=t.long, device=device)
        mb_logp    = t.zeros(n_steps, n_envs, device=device)
        mb_rewards = t.zeros(n_steps, n_envs, device=device)
        mb_dones   = t.zeros(n_steps, n_envs, device=device)
        mb_values  = t.zeros(n_steps, n_envs, device=device)

        with t.no_grad():
            for step in range(n_steps):
                obs_t: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
                probs: Float[Tensor, "env action"] = policy(obs_t)
                val:   Float[Tensor, "env"]        = rearrange(value_net(obs_t), 'env 1 -> env')

                m = t.distributions.Categorical(probs)
                action: Int[Tensor, "env"] = m.sample()

                mb_obs[step]     = obs_t
                mb_actions[step] = action
                mb_logp[step]    = m.log_prob(action)
                mb_values[step]  = val

                obs, reward, terminated, truncated, _ = envs.step(action.cpu().numpy())
                done = terminated | truncated

                mb_rewards[step] = t.tensor(reward, dtype=t.float32, device=device)
                mb_dones[step]   = t.tensor(done, dtype=t.float32, device=device)

                running_rewards += reward
                done_mask = done.astype(bool)
                if done_mask.any():
                    completed_rewards.extend(running_rewards[done_mask].tolist())
                    running_rewards[done_mask] = 0.0

            next_obs: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
            next_val: Float[Tensor, "env"]       = rearrange(value_net(next_obs), 'env 1 -> env')

        advantages, returns = compute_gae(mb_rewards, mb_values, mb_dones, next_val, gamma, lmbda)

        flat_obs: Float[Tensor, "batch state"]  = rearrange(mb_obs, 'step env state -> (step env) state')
        flat_act: Int[Tensor, "batch"]          = rearrange(mb_actions, 'step env -> (step env)')
        flat_ret: Float[Tensor, "batch"]        = rearrange(returns, 'step env -> (step env)')
        flat_adv: Float[Tensor, "batch"]        = rearrange(advantages, 'step env -> (step env)')
        flat_adv = (flat_adv - flat_adv.mean()) / (flat_adv.std() + 1e-8)

        curr_probs: Float[Tensor, "batch action"] = policy(flat_obs)
        curr_m = t.distributions.Categorical(curr_probs)
        log_pi:   Float[Tensor, "batch"] = curr_m.log_prob(flat_act)
        curr_val: Float[Tensor, "batch"] = rearrange(value_net(flat_obs), 'batch 1 -> batch')

        p_loss:  Float[Tensor, ""] = -reduce(log_pi * flat_adv.detach(), 'batch -> ()', 'mean')
        v_loss:  Float[Tensor, ""] = 0.5 * F.mse_loss(curr_val, flat_ret.detach())
        entropy: Float[Tensor, ""] = curr_m.entropy().mean()
        loss:    Float[Tensor, ""] = p_loss + v_loss - 0.01 * entropy

        optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(policy.parameters(), 0.5)
        nn.utils.clip_grad_norm_(value_net.parameters(), 0.5)
        optimizer.step()

    envs.close()
    if save_path:
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        t.save(policy.state_dict(), save_path)
    return completed_rewards[:episodes]

Visualizing Performance: A2C + GAE

A2C updates policy online and completes its evaluation with 104 steps. Due to bootstrapping, the policy is sensitive under deterministic greedy evaluations but balances long enough to verify learning.

A2C Agent


Step 4: Adding Trust Regions — Proximal Policy Optimization (PPO)

To run multiple update epochs per rollout without catastrophic policy collapse, PPO uses importance sampling ratios rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} and clips the objective function:

LCLIP(θ)=E^t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{CLIP}}(\theta) = \hat{\mathbb{E}}_t \left[ \min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right) \right]

Here is the clean, working PPO implementation:

def run_ppo(
    episodes: int = 1000,
    seed: int = 0,
    n_envs: int = N_ENVS,
    n_steps: int = N_STEPS,
    gamma: float = 0.99,
    lmbda: float = 0.95,
    lr: float = 3e-4,
    use_clip: bool = True,
    clip_ratio: float = 0.2,
    ent_coef: float = 0.01,
    ppo_epochs: int = 4,
    n_minibatches: int = 4,
    save_path: str = None
) -> list[float]:
    print("  Running PPO Standard...")
    envs = gym.vector.SyncVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(n_envs)])
    obs, _ = envs.reset(seed=seed)

    state_size: int = envs.single_observation_space.shape[0]
    action_size: int = envs.single_action_space.n
    batch_size = n_envs * n_steps
    minibatch_size = batch_size // n_minibatches

    policy = PolicyNetwork(state_size, action_size).to(device)
    value_net = ValueNetwork(state_size).to(device)
    ortho_init(policy, value_net)

    optimizer_policy = optim.Adam(policy.parameters(), lr=lr, eps=1e-5)
    optimizer_value = optim.Adam(value_net.parameters(), lr=0.01, eps=1e-5)

    running_rewards = np.zeros(n_envs)
    completed_rewards: list[float] = []

    while len(completed_rewards) < episodes:
        mb_obs     = t.zeros(n_steps, n_envs, state_size, device=device)
        mb_actions = t.zeros(n_steps, n_envs, dtype=t.long, device=device)
        mb_logp    = t.zeros(n_steps, n_envs, device=device)
        mb_rewards = t.zeros(n_steps, n_envs, device=device)
        mb_dones   = t.zeros(n_steps, n_envs, device=device)
        mb_values  = t.zeros(n_steps, n_envs, device=device)

        with t.no_grad():
            for step in range(n_steps):
                obs_t: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
                probs: Float[Tensor, "env action"] = policy(obs_t)
                val:   Float[Tensor, "env"]        = rearrange(value_net(obs_t), 'env 1 -> env')

                m = t.distributions.Categorical(probs)
                action: Int[Tensor, "env"] = m.sample()

                mb_obs[step]     = obs_t
                mb_actions[step] = action
                mb_logp[step]    = m.log_prob(action)
                mb_values[step]  = val

                obs, reward, terminated, truncated, _ = envs.step(action.cpu().numpy())
                done = terminated | truncated

                mb_rewards[step] = t.tensor(reward, dtype=t.float32, device=device)
                mb_dones[step]   = t.tensor(done, dtype=t.float32, device=device)

                running_rewards += reward
                done_mask = done.astype(bool)
                if done_mask.any():
                    completed_rewards.extend(running_rewards[done_mask].tolist())
                    running_rewards[done_mask] = 0.0

            next_obs: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
            next_val: Float[Tensor, "env"]       = rearrange(value_net(next_obs), 'env 1 -> env')

        advantages, returns = compute_gae(mb_rewards, mb_values, mb_dones, next_val, gamma, lmbda)

        flat_obs:  Float[Tensor, "batch state"] = rearrange(mb_obs, 'step env state -> (step env) state')
        flat_act:  Int[Tensor, "batch"]         = rearrange(mb_actions, 'step env -> (step env)')
        flat_logp: Float[Tensor, "batch"]       = rearrange(mb_logp, 'step env -> (step env)')
        flat_ret:  Float[Tensor, "batch"]       = rearrange(returns, 'step env -> (step env)')
        flat_adv:  Float[Tensor, "batch"]       = rearrange(advantages, 'step env -> (step env)')
        flat_adv = (flat_adv - flat_adv.mean()) / (flat_adv.std() + 1e-8)

        for _ in range(ppo_epochs):
            indices = t.randperm(batch_size, device=device)
            for start in range(0, batch_size, minibatch_size):
                idx = indices[start : start + minibatch_size]

                mb_probs: Float[Tensor, "mini action"] = policy(flat_obs[idx])
                mb_m = t.distributions.Categorical(mb_probs)
                new_lp:  Float[Tensor, "mini"] = mb_m.log_prob(flat_act[idx])
                new_val: Float[Tensor, "mini"] = rearrange(value_net(flat_obs[idx]), 'mini 1 -> mini')

                ratio: Float[Tensor, "mini"] = t.exp(new_lp - flat_logp[idx])
                adv = flat_adv[idx]

                if use_clip:
                    surr1: Float[Tensor, "mini"] = ratio * adv
                    surr2: Float[Tensor, "mini"] = t.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio) * adv
                    p_loss: Float[Tensor, ""] = -reduce(t.min(surr1, surr2), 'mini -> ()', 'mean')
                else:
                    p_loss: Float[Tensor, ""] = -reduce(ratio * adv, 'mini -> ()', 'mean')

                v_loss:  Float[Tensor, ""] = 0.5 * F.mse_loss(new_val, flat_ret[idx])
                entropy: Float[Tensor, ""] = mb_m.entropy().mean()
                loss:    Float[Tensor, ""] = p_loss + v_loss - ent_coef * entropy

                optimizer_policy.zero_grad()
                optimizer_value.zero_grad()
                loss.backward()
                nn.utils.clip_grad_norm_(policy.parameters(), 0.5)
                nn.utils.clip_grad_norm_(value_net.parameters(), 0.5)
                optimizer_policy.step()
                optimizer_value.step()

    envs.close()
    if save_path:
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        t.save(policy.state_dict(), save_path)
    return completed_rewards[:episodes]

Visualizing Performance: PPO Standard (Fixed)

PPO Standard converges extremely fast and balances the pole perfectly for the maximum 500 steps. The motion is smooth and centered.

PPO Standard


Step 5: The Mystery of the Vectorized Flat-line (Failed Variations & RCA)

When I first ran A2C and PPO, they completely flat-lined near the random baseline (earning ~20 reward per episode). I was baffled: how could REINFORCE learn successfully, while these advanced algorithms failed entirely?

By digging into the metrics, I uncovered three distinct bugs in my implementation:

Bug 1: Combined Gradient Clipping (RCA 1)

I had initially combined the actor and critic parameters into a single list and applied gradient clipping globally:

# BUGGY CODE: Joint Gradient Clipping
all_params = list(policy.parameters()) + list(value_net.parameters())
optimizer = optim.Adam(all_params, lr=lr)
...
loss.backward()
nn.utils.clip_grad_norm_(all_params, 0.5)
optimizer.step()
  • Why it failed: In CartPole, returns can reach up to 500. This makes the MSE value loss (v_lossv\_loss) massive, meaning the value network gradients are orders of magnitude larger than the policy gradients (e.g. value norm 10.0\approx 10.0 vs policy norm 0.09\approx 0.09). The joint clip_grad_norm_ divides all gradients by the total norm, scaling down the policy gradients to near-zero (0.004\approx 0.004) and freezing policy learning.
  • The Fix: Clip the policy and value network gradients independently:
nn.utils.clip_grad_norm_(policy.parameters(), 0.5)
nn.utils.clip_grad_norm_(value_net.parameters(), 0.5)

Bug 2: Update Frequency Deficit (RCA 2)

I evaluated the algorithms by counting completed episodes (e.g. up to 1,000 completed episodes). During early training, CartPole episodes terminate in 20\approx 20 steps.

  • Why it failed: A rollout length of n_steps=128 across 16 parallel environments collects 128×16=2048128 \times 16 = 2048 environment transitions per rollout. If episodes last only 20 steps, a single rollout contains over 100 completed episodes. Since PPO only updates once per rollout, it was performing 1 update per 100 completed episodes, whereas REINFORCE was updating 100 times. When PPO reached 1,000 completed episodes, it had only performed 10\approx 10 optimization updates!
  • The Fix: Decreased n_steps to 32. This increases update frequency by a factor of 4, bootstrapping the agent early so episodes become longer.

Bug 3: Critic Network Learning Rate Bottleneck (RCA 3)

I initially used the standard deep RL learning rate of 3e-4 for both policy and value networks.

  • Why it failed: Fitting a value function that predicts targets up to 500500 requires a much faster learning rate than 3e-4. Under the low learning rate, the critic predicted values near 0, making the advantage estimates highly inaccurate.
  • The Fix: Set the critic learning rate to 0.01 (and policy learning rate to 3e-4 or 0.01 depending on the algorithm) to allow the value function to fit targets quickly.

The Buggy PPO Implementation

Here is the exact code for the "Buggy PPO" variation containing all three traps:

def run_ppo_buggy(
    episodes: int = 1000,
    seed: int = 0,
    n_envs: int = N_ENVS,
    n_steps: int = 128,  # Bug 1: Evaluation frequency deficit
    gamma: float = 0.99,
    lmbda: float = 0.95,
    lr: float = 3e-4,     # Policy LR
    use_clip: bool = True,
    clip_ratio: float = 0.2,
    ent_coef: float = 0.01,
    ppo_epochs: int = 4,
    n_minibatches: int = 4,
    save_path: str = None
) -> list[float]:
    print("  Running PPO Buggy (combined clipping, lr_value=3e-4, n_steps=128)...")
    envs = gym.vector.SyncVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(n_envs)])
    obs, _ = envs.reset(seed=seed)

    state_size: int = envs.single_observation_space.shape[0]
    action_size: int = envs.single_action_space.n
    batch_size = n_envs * n_steps
    minibatch_size = batch_size // n_minibatches

    policy = PolicyNetwork(state_size, action_size).to(device)
    value_net = ValueNetwork(state_size).to(device)
    ortho_init(policy, value_net)

    # Bug 2: Combined optimizer with same low learning rate 3e-4 for critic
    all_params = list(policy.parameters()) + list(value_net.parameters())
    optimizer = optim.Adam(all_params, lr=lr, eps=1e-5)

    running_rewards = np.zeros(n_envs)
    completed_rewards: list[float] = []

    while len(completed_rewards) < episodes:
        mb_obs     = t.zeros(n_steps, n_envs, state_size, device=device)
        mb_actions = t.zeros(n_steps, n_envs, dtype=t.long, device=device)
        mb_logp    = t.zeros(n_steps, n_envs, device=device)
        mb_rewards = t.zeros(n_steps, n_envs, device=device)
        mb_dones   = t.zeros(n_steps, n_envs, device=device)
        mb_values  = t.zeros(n_steps, n_envs, device=device)

        with t.no_grad():
            for step in range(n_steps):
                obs_t: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
                probs: Float[Tensor, "env action"] = policy(obs_t)
                val:   Float[Tensor, "env"]        = rearrange(value_net(obs_t), 'env 1 -> env')

                m = t.distributions.Categorical(probs)
                action: Int[Tensor, "env"] = m.sample()

                mb_obs[step]     = obs_t
                mb_actions[step] = action
                mb_logp[step]    = m.log_prob(action)
                mb_values[step]  = val

                obs, reward, terminated, truncated, _ = envs.step(action.cpu().numpy())
                done = terminated | truncated

                mb_rewards[step] = t.tensor(reward, dtype=t.float32, device=device)
                mb_dones[step]   = t.tensor(done, dtype=t.float32, device=device)

                running_rewards += reward
                done_mask = done.astype(bool)
                if done_mask.any():
                    completed_rewards.extend(running_rewards[done_mask].tolist())
                    running_rewards[done_mask] = 0.0

            next_obs: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
            next_val: Float[Tensor, "env"]       = rearrange(value_net(next_obs), 'env 1 -> env')

        advantages, returns = compute_gae(mb_rewards, mb_values, mb_dones, next_val, gamma, lmbda)

        flat_obs:  Float[Tensor, "batch state"] = rearrange(mb_obs, 'step env state -> (step env) state')
        flat_act:  Int[Tensor, "batch"]         = rearrange(mb_actions, 'step env -> (step env)')
        flat_logp: Float[Tensor, "batch"]       = rearrange(mb_logp, 'step env -> (step env)')
        flat_ret:  Float[Tensor, "batch"]       = rearrange(returns, 'step env -> (step env)')
        flat_adv:  Float[Tensor, "batch"]       = rearrange(advantages, 'step env -> (step env)')
        flat_adv = (flat_adv - flat_adv.mean()) / (flat_adv.std() + 1e-8)

        for _ in range(ppo_epochs):
            indices = t.randperm(batch_size, device=device)
            for start in range(0, batch_size, minibatch_size):
                idx = indices[start : start + minibatch_size]

                mb_probs: Float[Tensor, "mini action"] = policy(flat_obs[idx])
                mb_m = t.distributions.Categorical(mb_probs)
                new_lp:  Float[Tensor, "mini"] = mb_m.log_prob(flat_act[idx])
                new_val: Float[Tensor, "mini"] = rearrange(value_net(flat_obs[idx]), 'mini 1 -> mini')

                ratio: Float[Tensor, "mini"] = t.exp(new_lp - flat_logp[idx])
                adv = flat_adv[idx]

                if use_clip:
                    surr1: Float[Tensor, "mini"] = ratio * adv
                    surr2: Float[Tensor, "mini"] = t.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio) * adv
                    p_loss: Float[Tensor, ""] = -reduce(t.min(surr1, surr2), 'mini -> ()', 'mean')
                else:
                    p_loss: Float[Tensor, ""] = -reduce(ratio * adv, 'mini -> ()', 'mean')

                v_loss:  Float[Tensor, ""] = 0.5 * F.mse_loss(new_val, flat_ret[idx])
                entropy: Float[Tensor, ""] = mb_m.entropy().mean()
                loss:    Float[Tensor, ""] = p_loss + v_loss - ent_coef * entropy

                optimizer.zero_grad()
                loss.backward()
                # Bug 3: Combined gradient clipping
                nn.utils.clip_grad_norm_(all_params, 0.5)
                optimizer.step()

    envs.close()
    if save_path:
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        t.save(policy.state_dict(), save_path)
    return completed_rewards[:episodes]

Visualizing Performance: PPO Buggy (Failed)

As expected, the buggy version fails to learn, resulting in a policy that collapses and tips the pole within 64 steps in evaluation.

PPO Buggy


Step 6: Empirical Results & Final Comparisons

After addressing the bugs (applying independent gradient clipping, raising the value network learning rate to 0.01, and lowering rollout steps to 32), I reran the experiment over 5 independent random seeds per algorithm to evaluate training stability and capture the learning variance.

Here is the learning curve showing the moving average reward (solid line) and the standard deviation range (shaded region) across all five variations over 1,000 episodes:

Algorithm Performance Comparison

Summary of Results:

  • PPO Standard (Fixed): Extremely stable and consistent. It converges tightly to the maximum reward of 500.0 within ~530 episodes with very little variance between seeds.
  • A2C + GAE (Fixed): Also converges quickly to 500.0 reward within ~480 episodes, showing slightly wider variance bounds early on but stabilizing quickly.
  • REINFORCE (With Baseline): Learns steadily, but displays moderate variance between runs and takes longer to stabilize due to the inherent noise of Monte Carlo return rollouts.
  • REINFORCE (No Baseline): Exhibits the highest variance across seeds and slowest overall convergence, highlighting the necessity of baselines.
  • PPO Buggy (Failed): Flat-lines completely at the random baseline reward of ~20.0 with virtually zero variance, confirming it fails to learn under all seeds.

Conclusion

This investigation highlighted that in deep RL, implementation details are just as important as the mathematical formulation. A single combined gradient clipping line or a mismatched rollout size can completely freeze learning. Separating policy/value parameters, tuning value function optimizers to match prediction targets, and tracking update frequencies relative to completed episodes are critical checks when scaling from basic Monte Carlo baselines to vectorized step-based policy gradient methods.