Deep RL Diary: Implementing Policy Gradients on CartPole-v1 from Scratch
In reinforcement learning, policy gradient methods directly parameterize the policy and optimize it to maximize the expected cumulative reward . While mathematically elegant, policy gradient methods are notoriously sensitive to hyperparameters, prone to high-variance gradient estimates, and vulnerable to training instability.
This post documents my investigative journey of implementing these algorithms on CartPole-v1 in PyTorch. I started from the basic single-episode REINFORCE algorithm, scaled up to vectorized environments with A2C + GAE, implemented Proximal Policy Optimization (PPO), and encountered a series of subtle implementation traps that completely broke learning in the vectorized versions before resolving them.
Here is the step-by-step log of what worked, what failed, and why.
Step 1: The Basics — REINFORCE (No Baseline)
The foundation of policy optimization is the Policy Gradient Theorem. The gradient of the expected return is given by:
where is the Monte Carlo return from step .
In my first implementation, I used a vectorized environment to parallelize rollouts, but updated the policy after collecting full episodes. Here is the code for the basic REINFORCE algorithm without baseline:
def run_reinforce_no_baseline(
episodes: int = 1000,
seed: int = 0,
n_envs: int = N_ENVS,
gamma: float = 0.99,
lr: float = 0.01,
save_path: str = None
) -> list[float]:
print(" Running REINFORCE (No Baseline)...")
envs = gym.vector.SyncVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(n_envs)])
obs, _ = envs.reset(seed=seed)
state_size: int = envs.single_observation_space.shape[0]
action_size: int = envs.single_action_space.n
policy = PolicyNetwork(state_size, action_size).to(device)
optimizer = optim.Adam(policy.parameters(), lr=lr)
env_states: list[list[np.ndarray]] = [[] for _ in range(n_envs)]
env_actions: list[list[int]] = [[] for _ in range(n_envs)]
env_rewards: list[list[float]] = [[] for _ in range(n_envs)]
completed: list[float] = []
while len(completed) < episodes:
with t.no_grad():
obs_t: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
probs: Float[Tensor, "env action"] = policy(obs_t)
m = t.distributions.Categorical(probs)
action: Int[Tensor, "env"] = m.sample()
for i in range(n_envs):
env_states[i].append(obs[i].copy())
env_actions[i].append(action[i].item())
obs, reward, terminated, truncated, _ = envs.step(action.cpu().numpy())
done = terminated | truncated
for i in range(n_envs):
env_rewards[i].append(reward[i])
done_indices = np.where(done)[0]
if len(done_indices) > 0:
all_losses: list[Float[Tensor, ""]] = []
for i in done_indices:
ep_rewards = env_rewards[i]
completed.append(sum(ep_rewards))
returns: list[float] = []
G = 0.0
for r in reversed(ep_rewards):
G = r + gamma * G
returns.append(G)
returns.reverse()
G_t: Float[Tensor, "time"] = t.tensor(returns, dtype=t.float32, device=device)
# Normalize returns to scale gradients
G_t = (G_t - G_t.mean()) / (G_t.std() + 1e-8)
ep_obs: Float[Tensor, "time state"] = t.tensor(
np.array(env_states[i]), dtype=t.float32, device=device
)
ep_act: Int[Tensor, "time"] = t.tensor(env_actions[i], dtype=t.long, device=device)
ep_probs: Float[Tensor, "time action"] = policy(ep_obs)
ep_m = t.distributions.Categorical(ep_probs)
log_pi: Float[Tensor, "time"] = ep_m.log_prob(ep_act)
all_losses.append(-reduce(G_t * log_pi, 'time -> ()', 'sum'))
env_states[i] = []
env_actions[i] = []
env_rewards[i] = []
total_loss: Float[Tensor, ""] = t.stack(all_losses).mean()
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
envs.close()
if save_path:
os.makedirs(os.path.dirname(save_path), exist_ok=True)
t.save(policy.state_dict(), save_path)
return completed[:episodes]
Visualizing Performance: Random vs. REINFORCE (No Baseline)
Before training, a Random Agent wiggles aimlessly and fails in under 15 steps. Once trained with REINFORCE (No Baseline), the agent manages to balance the pole for the full duration of 500 steps, though with visible jittering.
| Random Agent (Failed) | REINFORCE No Baseline (Balanced) |
|---|---|
![]() | ![]() |
Step 2: Variance Reduction — REINFORCE (With Baseline)
Because is computed from a single full rollout trajectory, basic REINFORCE is highly sensitive to environment stochasticity. To reduce variance without introducing bias, we subtract a state-dependent baseline from the return. The advantage estimator becomes .
Mathematically, subtracting any baseline that does not depend on the action keeps the gradient unbiased because the expected value of the baseline gradient is zero:
In code, I trained a concurrent critic network using mean-squared error to predict the returns:
def run_reinforce_with_baseline(
episodes: int = 1000,
seed: int = 0,
n_envs: int = N_ENVS,
gamma: float = 0.99,
lr_policy: float = 0.01,
lr_value: float = 0.01,
save_path: str = None
) -> list[float]:
print(" Running REINFORCE (With Baseline)...")
envs = gym.vector.SyncVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(n_envs)])
obs, _ = envs.reset(seed=seed)
state_size: int = envs.single_observation_space.shape[0]
action_size: int = envs.single_action_space.n
policy = PolicyNetwork(state_size, action_size).to(device)
value_net = ValueNetwork(state_size).to(device)
all_params = list(policy.parameters()) + list(value_net.parameters())
optimizer = optim.Adam(all_params, lr=lr_policy)
env_states: list[list[np.ndarray]] = [[] for _ in range(n_envs)]
env_actions: list[list[int]] = [[] for _ in range(n_envs)]
env_rewards: list[list[float]] = [[] for _ in range(n_envs)]
completed: list[float] = []
while len(completed) < episodes:
with t.no_grad():
obs_t: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
probs: Float[Tensor, "env action"] = policy(obs_t)
m = t.distributions.Categorical(probs)
action: Int[Tensor, "env"] = m.sample()
for i in range(n_envs):
env_states[i].append(obs[i].copy())
env_actions[i].append(action[i].item())
obs, reward, terminated, truncated, _ = envs.step(action.cpu().numpy())
done = terminated | truncated
for i in range(n_envs):
env_rewards[i].append(reward[i])
done_indices = np.where(done)[0]
if len(done_indices) > 0:
all_losses: list[Float[Tensor, ""]] = []
for i in done_indices:
ep_rewards = env_rewards[i]
completed.append(sum(ep_rewards))
returns_list: list[float] = []
G = 0.0
for r in reversed(ep_rewards):
G = r + gamma * G
returns_list.append(G)
returns_list.reverse()
G_t: Float[Tensor, "time"] = t.tensor(returns_list, dtype=t.float32, device=device)
ep_obs: Float[Tensor, "time state"] = t.tensor(
np.array(env_states[i]), dtype=t.float32, device=device
)
ep_act: Int[Tensor, "time"] = t.tensor(env_actions[i], dtype=t.long, device=device)
ep_probs: Float[Tensor, "time action"] = policy(ep_obs)
ep_m = t.distributions.Categorical(ep_probs)
log_pi: Float[Tensor, "time"] = ep_m.log_prob(ep_act)
V_t: Float[Tensor, "time"] = rearrange(value_net(ep_obs), 'time 1 -> time')
A_t: Float[Tensor, "time"] = G_t - V_t.detach()
policy_loss: Float[Tensor, ""] = -reduce(A_t * log_pi, 'time -> ()', 'sum')
value_loss: Float[Tensor, ""] = reduce((V_t - G_t) ** 2, 'time -> ()', 'mean')
all_losses.append(policy_loss + value_loss)
env_states[i] = []
env_actions[i] = []
env_rewards[i] = []
total_loss: Float[Tensor, ""] = t.stack(all_losses).mean()
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
envs.close()
if save_path:
os.makedirs(os.path.dirname(save_path), exist_ok=True)
t.save(policy.state_dict(), save_path)
return completed[:episodes]
Visualizing Performance: REINFORCE (With Baseline)
With the baseline reducing gradient variance, training stabilizes, and the resulting policy balances the pole for 362 steps in evaluation.

Step 3: Scaling Up — Vectorized A2C + GAE
Instead of waiting for an entire episode to finish (Monte Carlo), online Actor-Critic methods update the policy using temporal difference (TD) learning. To balance bias and variance, we can compute Generalized Advantage Estimation (GAE) across a fixed rollout window () across parallel environments ().
GAE introduces a mixing parameter :
where is the TD error.
Here is the vectorized GAE implementation, followed by the main A2C training loop:
def compute_gae(
rewards: Float[Tensor, "step env"],
values: Float[Tensor, "step env"],
dones: Float[Tensor, "step env"],
next_value: Float[Tensor, "env"],
gamma: float,
lmbda: float,
) -> tuple[Float[Tensor, "step env"], Float[Tensor, "step env"]]:
n_steps = rewards.shape[0]
advantages: Float[Tensor, "step env"] = t.zeros_like(rewards)
gae: Float[Tensor, "env"] = t.zeros_like(next_value)
for step in reversed(range(n_steps)):
next_val = next_value if step == n_steps - 1 else values[step + 1]
not_done: Float[Tensor, "env"] = 1.0 - dones[step]
delta: Float[Tensor, "env"] = rewards[step] + gamma * next_val * not_done - values[step]
gae = delta + gamma * lmbda * not_done * gae
advantages[step] = gae
returns: Float[Tensor, "step env"] = advantages + values
return advantages, returns
def run_a2c_gae(
episodes: int = 1000,
seed: int = 0,
n_envs: int = N_ENVS,
n_steps: int = N_STEPS,
gamma: float = 0.99,
lmbda: float = 0.95,
lr: float = 0.01,
save_path: str = None
) -> list[float]:
print(" Running A2C + GAE (vectorized)...")
envs = gym.vector.SyncVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(n_envs)])
obs, _ = envs.reset(seed=seed)
state_size: int = envs.single_observation_space.shape[0]
action_size: int = envs.single_action_space.n
policy = PolicyNetwork(state_size, action_size).to(device)
value_net = ValueNetwork(state_size).to(device)
all_params = list(policy.parameters()) + list(value_net.parameters())
optimizer = optim.Adam(all_params, lr=lr)
running_rewards = np.zeros(n_envs)
completed_rewards: list[float] = []
while len(completed_rewards) < episodes:
mb_obs = t.zeros(n_steps, n_envs, state_size, device=device)
mb_actions = t.zeros(n_steps, n_envs, dtype=t.long, device=device)
mb_logp = t.zeros(n_steps, n_envs, device=device)
mb_rewards = t.zeros(n_steps, n_envs, device=device)
mb_dones = t.zeros(n_steps, n_envs, device=device)
mb_values = t.zeros(n_steps, n_envs, device=device)
with t.no_grad():
for step in range(n_steps):
obs_t: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
probs: Float[Tensor, "env action"] = policy(obs_t)
val: Float[Tensor, "env"] = rearrange(value_net(obs_t), 'env 1 -> env')
m = t.distributions.Categorical(probs)
action: Int[Tensor, "env"] = m.sample()
mb_obs[step] = obs_t
mb_actions[step] = action
mb_logp[step] = m.log_prob(action)
mb_values[step] = val
obs, reward, terminated, truncated, _ = envs.step(action.cpu().numpy())
done = terminated | truncated
mb_rewards[step] = t.tensor(reward, dtype=t.float32, device=device)
mb_dones[step] = t.tensor(done, dtype=t.float32, device=device)
running_rewards += reward
done_mask = done.astype(bool)
if done_mask.any():
completed_rewards.extend(running_rewards[done_mask].tolist())
running_rewards[done_mask] = 0.0
next_obs: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
next_val: Float[Tensor, "env"] = rearrange(value_net(next_obs), 'env 1 -> env')
advantages, returns = compute_gae(mb_rewards, mb_values, mb_dones, next_val, gamma, lmbda)
flat_obs: Float[Tensor, "batch state"] = rearrange(mb_obs, 'step env state -> (step env) state')
flat_act: Int[Tensor, "batch"] = rearrange(mb_actions, 'step env -> (step env)')
flat_ret: Float[Tensor, "batch"] = rearrange(returns, 'step env -> (step env)')
flat_adv: Float[Tensor, "batch"] = rearrange(advantages, 'step env -> (step env)')
flat_adv = (flat_adv - flat_adv.mean()) / (flat_adv.std() + 1e-8)
curr_probs: Float[Tensor, "batch action"] = policy(flat_obs)
curr_m = t.distributions.Categorical(curr_probs)
log_pi: Float[Tensor, "batch"] = curr_m.log_prob(flat_act)
curr_val: Float[Tensor, "batch"] = rearrange(value_net(flat_obs), 'batch 1 -> batch')
p_loss: Float[Tensor, ""] = -reduce(log_pi * flat_adv.detach(), 'batch -> ()', 'mean')
v_loss: Float[Tensor, ""] = 0.5 * F.mse_loss(curr_val, flat_ret.detach())
entropy: Float[Tensor, ""] = curr_m.entropy().mean()
loss: Float[Tensor, ""] = p_loss + v_loss - 0.01 * entropy
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(policy.parameters(), 0.5)
nn.utils.clip_grad_norm_(value_net.parameters(), 0.5)
optimizer.step()
envs.close()
if save_path:
os.makedirs(os.path.dirname(save_path), exist_ok=True)
t.save(policy.state_dict(), save_path)
return completed_rewards[:episodes]
Visualizing Performance: A2C + GAE
A2C updates policy online and completes its evaluation with 104 steps. Due to bootstrapping, the policy is sensitive under deterministic greedy evaluations but balances long enough to verify learning.

Step 4: Adding Trust Regions — Proximal Policy Optimization (PPO)
To run multiple update epochs per rollout without catastrophic policy collapse, PPO uses importance sampling ratios and clips the objective function:
Here is the clean, working PPO implementation:
def run_ppo(
episodes: int = 1000,
seed: int = 0,
n_envs: int = N_ENVS,
n_steps: int = N_STEPS,
gamma: float = 0.99,
lmbda: float = 0.95,
lr: float = 3e-4,
use_clip: bool = True,
clip_ratio: float = 0.2,
ent_coef: float = 0.01,
ppo_epochs: int = 4,
n_minibatches: int = 4,
save_path: str = None
) -> list[float]:
print(" Running PPO Standard...")
envs = gym.vector.SyncVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(n_envs)])
obs, _ = envs.reset(seed=seed)
state_size: int = envs.single_observation_space.shape[0]
action_size: int = envs.single_action_space.n
batch_size = n_envs * n_steps
minibatch_size = batch_size // n_minibatches
policy = PolicyNetwork(state_size, action_size).to(device)
value_net = ValueNetwork(state_size).to(device)
ortho_init(policy, value_net)
optimizer_policy = optim.Adam(policy.parameters(), lr=lr, eps=1e-5)
optimizer_value = optim.Adam(value_net.parameters(), lr=0.01, eps=1e-5)
running_rewards = np.zeros(n_envs)
completed_rewards: list[float] = []
while len(completed_rewards) < episodes:
mb_obs = t.zeros(n_steps, n_envs, state_size, device=device)
mb_actions = t.zeros(n_steps, n_envs, dtype=t.long, device=device)
mb_logp = t.zeros(n_steps, n_envs, device=device)
mb_rewards = t.zeros(n_steps, n_envs, device=device)
mb_dones = t.zeros(n_steps, n_envs, device=device)
mb_values = t.zeros(n_steps, n_envs, device=device)
with t.no_grad():
for step in range(n_steps):
obs_t: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
probs: Float[Tensor, "env action"] = policy(obs_t)
val: Float[Tensor, "env"] = rearrange(value_net(obs_t), 'env 1 -> env')
m = t.distributions.Categorical(probs)
action: Int[Tensor, "env"] = m.sample()
mb_obs[step] = obs_t
mb_actions[step] = action
mb_logp[step] = m.log_prob(action)
mb_values[step] = val
obs, reward, terminated, truncated, _ = envs.step(action.cpu().numpy())
done = terminated | truncated
mb_rewards[step] = t.tensor(reward, dtype=t.float32, device=device)
mb_dones[step] = t.tensor(done, dtype=t.float32, device=device)
running_rewards += reward
done_mask = done.astype(bool)
if done_mask.any():
completed_rewards.extend(running_rewards[done_mask].tolist())
running_rewards[done_mask] = 0.0
next_obs: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
next_val: Float[Tensor, "env"] = rearrange(value_net(next_obs), 'env 1 -> env')
advantages, returns = compute_gae(mb_rewards, mb_values, mb_dones, next_val, gamma, lmbda)
flat_obs: Float[Tensor, "batch state"] = rearrange(mb_obs, 'step env state -> (step env) state')
flat_act: Int[Tensor, "batch"] = rearrange(mb_actions, 'step env -> (step env)')
flat_logp: Float[Tensor, "batch"] = rearrange(mb_logp, 'step env -> (step env)')
flat_ret: Float[Tensor, "batch"] = rearrange(returns, 'step env -> (step env)')
flat_adv: Float[Tensor, "batch"] = rearrange(advantages, 'step env -> (step env)')
flat_adv = (flat_adv - flat_adv.mean()) / (flat_adv.std() + 1e-8)
for _ in range(ppo_epochs):
indices = t.randperm(batch_size, device=device)
for start in range(0, batch_size, minibatch_size):
idx = indices[start : start + minibatch_size]
mb_probs: Float[Tensor, "mini action"] = policy(flat_obs[idx])
mb_m = t.distributions.Categorical(mb_probs)
new_lp: Float[Tensor, "mini"] = mb_m.log_prob(flat_act[idx])
new_val: Float[Tensor, "mini"] = rearrange(value_net(flat_obs[idx]), 'mini 1 -> mini')
ratio: Float[Tensor, "mini"] = t.exp(new_lp - flat_logp[idx])
adv = flat_adv[idx]
if use_clip:
surr1: Float[Tensor, "mini"] = ratio * adv
surr2: Float[Tensor, "mini"] = t.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio) * adv
p_loss: Float[Tensor, ""] = -reduce(t.min(surr1, surr2), 'mini -> ()', 'mean')
else:
p_loss: Float[Tensor, ""] = -reduce(ratio * adv, 'mini -> ()', 'mean')
v_loss: Float[Tensor, ""] = 0.5 * F.mse_loss(new_val, flat_ret[idx])
entropy: Float[Tensor, ""] = mb_m.entropy().mean()
loss: Float[Tensor, ""] = p_loss + v_loss - ent_coef * entropy
optimizer_policy.zero_grad()
optimizer_value.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(policy.parameters(), 0.5)
nn.utils.clip_grad_norm_(value_net.parameters(), 0.5)
optimizer_policy.step()
optimizer_value.step()
envs.close()
if save_path:
os.makedirs(os.path.dirname(save_path), exist_ok=True)
t.save(policy.state_dict(), save_path)
return completed_rewards[:episodes]
Visualizing Performance: PPO Standard (Fixed)
PPO Standard converges extremely fast and balances the pole perfectly for the maximum 500 steps. The motion is smooth and centered.

Step 5: The Mystery of the Vectorized Flat-line (Failed Variations & RCA)
When I first ran A2C and PPO, they completely flat-lined near the random baseline (earning ~20 reward per episode). I was baffled: how could REINFORCE learn successfully, while these advanced algorithms failed entirely?
By digging into the metrics, I uncovered three distinct bugs in my implementation:
Bug 1: Combined Gradient Clipping (RCA 1)
I had initially combined the actor and critic parameters into a single list and applied gradient clipping globally:
# BUGGY CODE: Joint Gradient Clipping
all_params = list(policy.parameters()) + list(value_net.parameters())
optimizer = optim.Adam(all_params, lr=lr)
...
loss.backward()
nn.utils.clip_grad_norm_(all_params, 0.5)
optimizer.step()
- Why it failed: In CartPole, returns can reach up to 500. This makes the MSE value loss () massive, meaning the value network gradients are orders of magnitude larger than the policy gradients (e.g. value norm vs policy norm ). The joint
clip_grad_norm_divides all gradients by the total norm, scaling down the policy gradients to near-zero () and freezing policy learning. - The Fix: Clip the policy and value network gradients independently:
nn.utils.clip_grad_norm_(policy.parameters(), 0.5)
nn.utils.clip_grad_norm_(value_net.parameters(), 0.5)
Bug 2: Update Frequency Deficit (RCA 2)
I evaluated the algorithms by counting completed episodes (e.g. up to 1,000 completed episodes). During early training, CartPole episodes terminate in steps.
- Why it failed: A rollout length of
n_steps=128across 16 parallel environments collects environment transitions per rollout. If episodes last only 20 steps, a single rollout contains over 100 completed episodes. Since PPO only updates once per rollout, it was performing 1 update per 100 completed episodes, whereas REINFORCE was updating 100 times. When PPO reached 1,000 completed episodes, it had only performed optimization updates! - The Fix: Decreased
n_stepsto32. This increases update frequency by a factor of 4, bootstrapping the agent early so episodes become longer.
Bug 3: Critic Network Learning Rate Bottleneck (RCA 3)
I initially used the standard deep RL learning rate of 3e-4 for both policy and value networks.
- Why it failed: Fitting a value function that predicts targets up to requires a much faster learning rate than
3e-4. Under the low learning rate, the critic predicted values near 0, making the advantage estimates highly inaccurate. - The Fix: Set the critic learning rate to
0.01(and policy learning rate to3e-4or0.01depending on the algorithm) to allow the value function to fit targets quickly.
The Buggy PPO Implementation
Here is the exact code for the "Buggy PPO" variation containing all three traps:
def run_ppo_buggy(
episodes: int = 1000,
seed: int = 0,
n_envs: int = N_ENVS,
n_steps: int = 128, # Bug 1: Evaluation frequency deficit
gamma: float = 0.99,
lmbda: float = 0.95,
lr: float = 3e-4, # Policy LR
use_clip: bool = True,
clip_ratio: float = 0.2,
ent_coef: float = 0.01,
ppo_epochs: int = 4,
n_minibatches: int = 4,
save_path: str = None
) -> list[float]:
print(" Running PPO Buggy (combined clipping, lr_value=3e-4, n_steps=128)...")
envs = gym.vector.SyncVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(n_envs)])
obs, _ = envs.reset(seed=seed)
state_size: int = envs.single_observation_space.shape[0]
action_size: int = envs.single_action_space.n
batch_size = n_envs * n_steps
minibatch_size = batch_size // n_minibatches
policy = PolicyNetwork(state_size, action_size).to(device)
value_net = ValueNetwork(state_size).to(device)
ortho_init(policy, value_net)
# Bug 2: Combined optimizer with same low learning rate 3e-4 for critic
all_params = list(policy.parameters()) + list(value_net.parameters())
optimizer = optim.Adam(all_params, lr=lr, eps=1e-5)
running_rewards = np.zeros(n_envs)
completed_rewards: list[float] = []
while len(completed_rewards) < episodes:
mb_obs = t.zeros(n_steps, n_envs, state_size, device=device)
mb_actions = t.zeros(n_steps, n_envs, dtype=t.long, device=device)
mb_logp = t.zeros(n_steps, n_envs, device=device)
mb_rewards = t.zeros(n_steps, n_envs, device=device)
mb_dones = t.zeros(n_steps, n_envs, device=device)
mb_values = t.zeros(n_steps, n_envs, device=device)
with t.no_grad():
for step in range(n_steps):
obs_t: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
probs: Float[Tensor, "env action"] = policy(obs_t)
val: Float[Tensor, "env"] = rearrange(value_net(obs_t), 'env 1 -> env')
m = t.distributions.Categorical(probs)
action: Int[Tensor, "env"] = m.sample()
mb_obs[step] = obs_t
mb_actions[step] = action
mb_logp[step] = m.log_prob(action)
mb_values[step] = val
obs, reward, terminated, truncated, _ = envs.step(action.cpu().numpy())
done = terminated | truncated
mb_rewards[step] = t.tensor(reward, dtype=t.float32, device=device)
mb_dones[step] = t.tensor(done, dtype=t.float32, device=device)
running_rewards += reward
done_mask = done.astype(bool)
if done_mask.any():
completed_rewards.extend(running_rewards[done_mask].tolist())
running_rewards[done_mask] = 0.0
next_obs: Float[Tensor, "env state"] = t.tensor(obs, dtype=t.float32, device=device)
next_val: Float[Tensor, "env"] = rearrange(value_net(next_obs), 'env 1 -> env')
advantages, returns = compute_gae(mb_rewards, mb_values, mb_dones, next_val, gamma, lmbda)
flat_obs: Float[Tensor, "batch state"] = rearrange(mb_obs, 'step env state -> (step env) state')
flat_act: Int[Tensor, "batch"] = rearrange(mb_actions, 'step env -> (step env)')
flat_logp: Float[Tensor, "batch"] = rearrange(mb_logp, 'step env -> (step env)')
flat_ret: Float[Tensor, "batch"] = rearrange(returns, 'step env -> (step env)')
flat_adv: Float[Tensor, "batch"] = rearrange(advantages, 'step env -> (step env)')
flat_adv = (flat_adv - flat_adv.mean()) / (flat_adv.std() + 1e-8)
for _ in range(ppo_epochs):
indices = t.randperm(batch_size, device=device)
for start in range(0, batch_size, minibatch_size):
idx = indices[start : start + minibatch_size]
mb_probs: Float[Tensor, "mini action"] = policy(flat_obs[idx])
mb_m = t.distributions.Categorical(mb_probs)
new_lp: Float[Tensor, "mini"] = mb_m.log_prob(flat_act[idx])
new_val: Float[Tensor, "mini"] = rearrange(value_net(flat_obs[idx]), 'mini 1 -> mini')
ratio: Float[Tensor, "mini"] = t.exp(new_lp - flat_logp[idx])
adv = flat_adv[idx]
if use_clip:
surr1: Float[Tensor, "mini"] = ratio * adv
surr2: Float[Tensor, "mini"] = t.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio) * adv
p_loss: Float[Tensor, ""] = -reduce(t.min(surr1, surr2), 'mini -> ()', 'mean')
else:
p_loss: Float[Tensor, ""] = -reduce(ratio * adv, 'mini -> ()', 'mean')
v_loss: Float[Tensor, ""] = 0.5 * F.mse_loss(new_val, flat_ret[idx])
entropy: Float[Tensor, ""] = mb_m.entropy().mean()
loss: Float[Tensor, ""] = p_loss + v_loss - ent_coef * entropy
optimizer.zero_grad()
loss.backward()
# Bug 3: Combined gradient clipping
nn.utils.clip_grad_norm_(all_params, 0.5)
optimizer.step()
envs.close()
if save_path:
os.makedirs(os.path.dirname(save_path), exist_ok=True)
t.save(policy.state_dict(), save_path)
return completed_rewards[:episodes]
Visualizing Performance: PPO Buggy (Failed)
As expected, the buggy version fails to learn, resulting in a policy that collapses and tips the pole within 64 steps in evaluation.

Step 6: Empirical Results & Final Comparisons
After addressing the bugs (applying independent gradient clipping, raising the value network learning rate to 0.01, and lowering rollout steps to 32), I reran the experiment over 5 independent random seeds per algorithm to evaluate training stability and capture the learning variance.
Here is the learning curve showing the moving average reward (solid line) and the standard deviation range (shaded region) across all five variations over 1,000 episodes:

Summary of Results:
- PPO Standard (Fixed): Extremely stable and consistent. It converges tightly to the maximum reward of 500.0 within ~530 episodes with very little variance between seeds.
- A2C + GAE (Fixed): Also converges quickly to 500.0 reward within ~480 episodes, showing slightly wider variance bounds early on but stabilizing quickly.
- REINFORCE (With Baseline): Learns steadily, but displays moderate variance between runs and takes longer to stabilize due to the inherent noise of Monte Carlo return rollouts.
- REINFORCE (No Baseline): Exhibits the highest variance across seeds and slowest overall convergence, highlighting the necessity of baselines.
- PPO Buggy (Failed): Flat-lines completely at the random baseline reward of ~20.0 with virtually zero variance, confirming it fails to learn under all seeds.
Conclusion
This investigation highlighted that in deep RL, implementation details are just as important as the mathematical formulation. A single combined gradient clipping line or a mismatched rollout size can completely freeze learning. Separating policy/value parameters, tuning value function optimizers to match prediction targets, and tracking update frequencies relative to completed episodes are critical checks when scaling from basic Monte Carlo baselines to vectorized step-based policy gradient methods.

