PYTHON · NUMPY · IN-BROWSER

Reinforcement Learning

Define a reward function and train an RL policy in-sim to convergence (smoothed reward >= 6.0) with a greedy rollout that reaches the goal safely and efficiently.

01Challenge

Same rover, new course C2 (open arena, one goal pad, one hazard). There are no labels and no expert — you're handed a training loop already wired up. It calls one function you must write: reward(obs, action, next_obs). Make the rover reliably reach the goal by writing only the reward. The trap: the obvious sparse reward (+1 at goal, 0 otherwise) leaves the curve flat near zero — the policy almost never stumbles onto the goal by chance. Reward design, not the algorithm, is the lever.

02Model

Last lesson you gave the rover answers — labeled examples. But what if all you can give it is a score, higher is better, and let it figure out the rest? Think of the reward as a landscape and learning as climbing it. A reward that's flat everywhere except one pinprick at the goal? The policy is blind — it wanders, never feeling which way is up.

Shape the landscape so getting closer already pays a little, and now there's a slope to climb. Episode by episode the policy nudges toward actions that scored well; watch the smoothed reward rise and cross the line — that's convergence. We never told it the path; we shaped the incentive, and the path fell out.

A reward shapes the policy until it converges.

03Guided practice

Step 1 (worked): run the harness with the sparse reward and plot the flat curve. Step 2 (worked): the provided potential-based progress term (2.0 * (goal_dist shrink)) plus terminal bonus. Step 3 (faded): add the hazard penalty (must exceed corner-cut progress) and a small per-step time penalty. Step 4 (independent): train to convergence and evaluate the greedy rollout.

04Feedback

PASS WHEN Smoothed final reward >= 6.0 and non-decreasing over the last 100 episodes, greedy rollout reaches the goal with no hazard entry in <= 220 steps, and re-converges at seed 441.

FAIL: final reward low and curve flat — reward is sparse; add a dense per-step term for closing goal_dist (potential-based shaping).
FAIL: converged but collided in hazard — hazard penalty < corner-cut progress; raise the penalty above the progress gained by clipping the zone.
FAIL: reward high but reached=False — shaping rewards moving not arriving; keep the terminal bonus and confirm progress = next_obs minus obs.

05Retrieve & space

From 3.1: what does RL have that supervised classification did not, that let it learn with no labels? (A reward signal / trial-and-error.)
From 2.2/2.3: which is easier to guarantee never enters the hazard, the FSM rule or the learned policy, and why? (Rules give guarantees; learned policies give statistics.)
From 2.1: re-train with heading_err removed from the observation — what happens to convergence and why? (The policy can only learn from what's in its state.)

06Mastery & project

Reward drives the RL loop to smoothed reward >= 6.0 non-decreasing over last 100 episodes, greedy rollout reaches goal with no hazard entry in <= 220 steps, re-converges at seed 441 (L3: design a reward for a safe, efficient policy and repair a degenerate curve).

Feeds the capstone (5.1) as the learned navigation component integrated beneath the FSM; in 5.2 its inference loop is a candidate to push to the metal if profiling flags it.

← Learn From Data Imitation and Sim-to-Real →

220\n o[\"collided\"] = o[\"in_hazard\"]\n return o\n\ndef wrap_(a): return (a + np.pi) % (2*np.pi) - np.pi\n\ndef reward(obs, action, next_obs):\n if next_obs[\"reached\"]: return 1.0\n return 0.0 # SPARSE — curve will stay flat near zero\nprint(\"harness ready; sparse reward defined (it will fail by design).\")\n","label":"1 — Tiny RL harness (provided) + your sparse reward (flat curve)"},{"code":"# Dense progress signal + terminal bonus; YOU add hazard + time penalties.\nHAZARD_PENALTY = 0.0 # <-- TUNE: must exceed the progress gained by clipping the corner\nTIME_PENALTY = 0.0 # <-- TUNE: small constant, e.g. 0.01, so it doesn't loiter\n\ndef reward(obs, action, next_obs):\n progress = obs[\"goal_dist\"] - next_obs[\"goal_dist\"] # >0 when getting closer\n r = 2.0 * progress\n if next_obs[\"reached\"]:\n r += 10.0\n if next_obs[\"in_hazard\"]:\n r -= HAZARD_PENALTY\n r -= TIME_PENALTY\n return r\nprint(\"shaped reward set. hazard_penalty =\", HAZARD_PENALTY, \" time_penalty =\", TIME_PENALTY)\n","label":"2 — Shaped reward (progress + your safety/time terms)"},{"code":"# A compact, seedable trainer: a softmax policy over 4 actions conditioned on a\n# coarse state bin; climbs the reward you designed. Deterministic for grading.\nimport numpy as np\n\ndef train(reward_fn, episodes=600, seed=440):\n r = np.random.default_rng(seed)\n theta = np.zeros((6, 4)) # 6 state bins x 4 actions\n def feat(o):\n b = min(5, int(o[\"goal_dist\"]))\n return b\n curve = []\n for ep in range(episodes):\n env = Arena(seed + (ep % 3)); o = env.reset()\n total = 0.0; grads = []\n for t in range(221):\n b = feat(o); z = theta[b] - theta[b].max(); p = np.exp(z); p /= p.sum()\n a = int(r.choice(4, p=p))\n no = env.step(a); rw = reward_fn(o, a, no); total += rw\n g = -p; g[a] += 1.0; grads.append((b, g, rw))\n o = no\n if o[\"done\"]: break\n for (b, g, rw) in grads:\n theta[b] += 0.02 * g * (total / 50.0) # crude REINFORCE update\n curve.append(total)\n train.policy = theta\n return curve\n\ndef smoothed(c, w=20):\n c = np.array(c); k = np.ones(w)/w\n return np.convolve(c, k, mode='valid')\n\ncurve = train(reward, 600, 440)\nfinal = float(smoothed(curve)[-1])\nprint(\"final smoothed reward:\", round(final, 2))\n","label":"3 — Train to convergence (deterministic policy-gradient-lite)"},{"code":"import numpy as np\ndef rollout(theta, seed):\n env = Arena(seed); o = env.reset()\n def feat(o): return min(5, int(o[\"goal_dist\"]))\n steps = 0\n for t in range(221):\n a = int(np.argmax(theta[feat(o)])); o = env.step(a); steps += 1\n if o[\"done\"]: break\n return o.get(\"reached\", False), o.get(\"collided\", False), steps\n\nsm = smoothed(curve); slope = sm[-1] - sm[-min(100, len(sm))]\nreached, collided, steps = rollout(train.policy, 440)\ncurve2 = train(reward, 600, 441); final2 = float(smoothed(curve2)[-1])\nif final < 6.0 or slope < 0:\n print(f\"FAIL: final smoothed reward {final:.1f} (need >=6.0, non-decreasing). Reward too \"\n f\"sparse -> add a dense per-step term for closing goal_dist (potential shaping).\")\nelif collided:\n print(\"FAIL: converged but collided in hazard — HAZARD_PENALTY is smaller than the \"\n \"progress gained by clipping the zone; raise it above that progress.\")\nelif not reached:\n print(\"FAIL: reward high but reached=False — shaping rewards moving, not arriving; \"\n \"keep the terminal bonus and confirm progress uses next_obs minus obs.\")\nelif steps > 220:\n print(f\"FAIL: reached but steps {steps} (>220) — no time penalty; subtract a small \"\n f\"constant each step (~0.01-0.05).\")\nelif final2 < 6.0:\n print(f\"FAIL: seed 440 passed but seed 441 reward {final2:.1f} — magnitudes tuned to one \"\n f\"init; prefer potential-based shaping invariant to seed.\")\nelse:\n print(f\"PASS: converged to {final:.1f}, reached goal safely in {steps} steps, \"\n f\"re-converges at seed 441 ({final2:.1f}).\")\n","label":"4 — Autograder (PASS = reward>=6, reach, safe, efficient, seed 441)"}],"intro":"Write the reward function (progress + safety + time) so the provided RL loop converges.","key":"programming/reinforcement-learning","kind":"python","title":"Reinforcement Learning"}">

PYTHON · NUMPY · IN-BROWSER

Reinforcement Learning

Write the reward function (progress + safety + time) so the provided RL loop converges.