Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Reinforcement Learning

Reinforcement learning (RL) is the study of how an agent learns to make sequential decisions by interacting with an environment in order to maximize cumulative reward Sutton & Barto, 2018. Unlike supervised learning, RL receives no labeled targets — only scalar feedback signals that may be delayed and sparse. This chapter develops the probabilistic foundations of RL, derives the key policy gradient algorithms, and examines the role of RL in modern language model training.

Source
import torch
import numpy as np
import matplotlib.pyplot as plt
from torch import nn
from torch.distributions import Categorical, Beta

Markov Decision Processes

The standard framework for sequential decision-making is the Markov decision process (MDP), defined by a tuple (S,A,p,r,γ)(\mathcal{S}, \mathcal{A}, p, r, \gamma):

  • S\mathcal{S}: state space (e.g. positions on a grid, token sequences).

  • A\mathcal{A}: action space (e.g. moves, next tokens).

  • p(ss,a)p(s' \mid s, a): transition distribution — the probability of reaching state ss' from state ss after taking action aa.

  • r(s,a)Rr(s, a) \in \mathbb{R}: reward function — the immediate signal received after taking action aa in state ss.

  • γ[0,1)\gamma \in [0, 1): discount factor — how much future rewards are down-weighted relative to immediate ones.

A policy π(as)\pi(a \mid s) is a distribution over actions given the current state. Starting from s0s_0, the agent follows the trajectory

s0a0πs1a1πs2a2πs_0 \xrightarrow{a_0 \sim \pi} s_1 \xrightarrow{a_1 \sim \pi} s_2 \xrightarrow{a_2 \sim \pi} \cdots

where st+1p(st,at)s_{t+1} \sim p(\cdot \mid s_t, a_t) and reward rt=r(st,at)r_t = r(s_t, a_t) is collected at each step. The return from time tt is the discounted sum of future rewards:

Gt=k=0γkrt+k.G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}.

The objective is to find a policy that maximizes expected return from the initial state:

J(π)=Eτπ[G0]=Eτπ ⁣[t=0Tγtrt],J(\pi) = \mathbb{E}_{\tau \sim \pi}[G_0] = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_{t=0}^{T} \gamma^t r_t\right],

where τ=(s0,a0,s1,a1,)\tau = (s_0, a_0, s_1, a_1, \ldots) denotes a trajectory sampled under π\pi.

Value Functions and the Bellman Equations

Two quantities are central to RL algorithms.

The state-value function Vπ(s)V^\pi(s) measures the expected return when starting in state ss and following π\pi thereafter:

Vπ(s)=Eτπ ⁣[Gtst=s]=Eaπ(s),sp(s,a) ⁣[r(s,a)+γVπ(s)].V^\pi(s) = \mathbb{E}_{\tau \sim \pi}\!\left[G_t \mid s_t = s\right] = \mathbb{E}_{a \sim \pi(\cdot \mid s),\, s' \sim p(\cdot \mid s,a)}\!\left[r(s, a) + \gamma V^\pi(s')\right].

The action-value function Qπ(s,a)Q^\pi(s, a) conditions on both state and action:

Qπ(s,a)=Eτπ ⁣[Gtst=s,at=a]=r(s,a)+γEsp(s,a) ⁣[Vπ(s)].Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\!\left[G_t \mid s_t = s, a_t = a\right] = r(s, a) + \gamma \mathbb{E}_{s' \sim p(\cdot \mid s, a)}\!\left[V^\pi(s')\right].

Both satisfy Bellman equations — recursive self-consistency conditions. For the optimal policy π\pi^*, these become the Bellman optimality equations:

V(s)=maxa[r(s,a)+γsp(ss,a)V(s)],Q(s,a)=r(s,a)+γsp(ss,a)maxaQ(s,a).V^*(s) = \max_a \left[ r(s, a) + \gamma \sum_{s'} p(s' \mid s, a)\, V^*(s') \right], \qquad Q^*(s, a) = r(s, a) + \gamma \sum_{s'} p(s' \mid s, a) \max_{a'} Q^*(s', a').

The advantage function measures how much better action aa is than the average under π\pi:

Aπ(s,a)=Qπ(s,a)Vπ(s).A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s).

When Aπ(s,a)>0A^\pi(s, a) > 0, action aa is better than average in state ss under π\pi.

Policy Gradient Methods

Policy gradient methods directly parameterize the policy πθ(as)\pi_\theta(a \mid s) and optimize J(θ)=Eτπθ[G0]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G_0] by gradient ascent. The key result is the policy gradient theorem Sutton & Barto, 2018:

θJ(θ)=Eτπθ ⁣[t=0Tθlogπθ(atst)Gt].\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t\right].

This has an elegant derivation. The probability of a trajectory under πθ\pi_\theta is

pθ(τ)=p(s0)t=0Tπθ(atst)p(st+1st,at),p_\theta(\tau) = p(s_0) \prod_{t=0}^T \pi_\theta(a_t \mid s_t)\, p(s_{t+1} \mid s_t, a_t),

and since the transition probabilities do not depend on θ\theta,

θlogpθ(τ)=t=0Tθlogπθ(atst).\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t).

Applying the log-derivative trick θE[f]=E[fθlogpθ]\nabla_\theta \mathbb{E}[f] = \mathbb{E}[f \nabla_\theta \log p_\theta] then yields the theorem.

REINFORCE

The simplest policy gradient algorithm is REINFORCE Williams, 1992. Given a batch of trajectories {τ(i)}\{\tau^{(i)}\} sampled from πθ\pi_\theta, the gradient estimate is:

^θJ(θ)=1Bi=1Bt=0Tiθlogπθ(at(i)st(i))Gt(i).\hat{\nabla}_\theta J(\theta) = \frac{1}{B}\sum_{i=1}^B \sum_{t=0}^{T_i} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)}) \cdot G_t^{(i)}.

This estimator is unbiased but can have high variance. A common variance-reduction technique subtracts a baseline b(st)b(s_t) — typically an estimate of Vπ(st)V^\pi(s_t) — from the return:

^θJ(θ)=1Bi=1Bt=0Tiθlogπθ(at(i)st(i))(Gt(i)b(st(i))).\hat{\nabla}_\theta J(\theta) = \frac{1}{B}\sum_{i=1}^B \sum_{t=0}^{T_i} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)}) \cdot \left(G_t^{(i)} - b(s_t^{(i)})\right).

Subtracting the baseline does not introduce bias (it has zero expectation under the policy gradient) but can dramatically reduce variance.

Source
# Multi-armed bandit: compare REINFORCE with Thompson sampling
# K-armed Bernoulli bandit

torch.manual_seed(42)
np.random.seed(42)

K = 5  # number of arms
true_probs = torch.tensor([0.1, 0.3, 0.5, 0.7, 0.9])  # true reward probs
T = 500  # time horizon

# ── Thompson sampling (Bayesian baseline) ──────────────────────────────────────
def thompson_sampling(K, true_probs, T):
    '''Beta-Bernoulli Thompson sampling on a K-armed bandit.'''
    alpha = torch.ones(K)  # Beta(alpha, beta) prior
    beta  = torch.ones(K)
    rewards = []
    for _ in range(T):
        # Sample from posterior
        theta = torch.distributions.Beta(alpha, beta).sample()
        arm = theta.argmax().item()
        reward = (torch.rand(1).item() < true_probs[arm].item())
        rewards.append(float(reward))
        if reward:
            alpha[arm] += 1
        else:
            beta[arm] += 1
    return rewards

# ── REINFORCE on bandit (stateless MDP, one step per episode) ──────────────────
def reinforce_bandit(K, true_probs, T, lr=0.05):
    '''REINFORCE for K-armed bandit: softmax policy, baseline = running mean.'''
    logits = torch.zeros(K, requires_grad=True)
    optimizer = torch.optim.Adam([logits], lr=lr)
    rewards = []
    baseline = 0.0
    alpha_ema = 0.05  # EMA coefficient for baseline
    for t in range(T):
        dist = Categorical(logits=logits)
        arm = dist.sample()
        reward = float(torch.rand(1).item() < true_probs[arm].item())
        rewards.append(reward)
        baseline = (1 - alpha_ema) * baseline + alpha_ema * reward
        loss = -dist.log_prob(arm) * (reward - baseline)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    return rewards

def running_mean(x, w=20):
    return np.convolve(x, np.ones(w)/w, mode='valid')

ts_rewards    = thompson_sampling(K, true_probs, T)
rf_rewards    = reinforce_bandit(K, true_probs, T)

fig, axes = plt.subplots(1, 2, figsize=(11, 4))

# Left: cumulative regret
optimal = true_probs.max().item()
ts_regret = np.cumsum(optimal - np.array(ts_rewards))
rf_regret = np.cumsum(optimal - np.array(rf_rewards))
axes[0].plot(ts_regret, label='Thompson sampling', color='tab:blue')
axes[0].plot(rf_regret, label='REINFORCE', color='tab:orange')
axes[0].set_xlabel('Step'); axes[0].set_ylabel('Cumulative regret')
axes[0].set_title('Cumulative regret on Bernoulli bandit')
axes[0].legend()

# Right: smoothed per-step reward
axes[1].plot(running_mean(ts_rewards), label='Thompson sampling', color='tab:blue')
axes[1].plot(running_mean(rf_rewards), label='REINFORCE', color='tab:orange')
axes[1].axhline(optimal, ls='--', color='k', lw=1, label='Optimal')
axes[1].set_xlabel('Step'); axes[1].set_ylabel('Reward (smoothed)')
axes[1].set_title('Per-step reward (20-step moving average)')
axes[1].legend()

plt.tight_layout()
plt.savefig('bandit_rl.png', dpi=100, bbox_inches='tight')
plt.show()
<Figure size 1100x400 with 2 Axes>

Actor-Critic Methods and PPO

REINFORCE uses the full Monte Carlo return GtG_t, which is unbiased but high-variance. Actor-critic methods replace GtG_t with a lower-variance estimate by maintaining a learned critic V^ϕ(s)\hat{V}_\phi(s) alongside the actor πθ\pi_\theta. The actor is updated using the advantage estimate

A^t=rt+γV^ϕ(st+1)V^ϕ(st),\hat{A}_t = r_t + \gamma \hat{V}_\phi(s_{t+1}) - \hat{V}_\phi(s_t),

which is an empirical version of the Bellman residual (also called the TD error). The critic is updated to minimize (V^ϕ(st)Gt)2(\hat{V}_\phi(s_t) - G_t)^2.

Proximal Policy Optimization (PPO)

A key practical challenge in policy gradient methods is choosing a step size: too large a step can collapse the policy, and the resulting poor samples make recovery slow. Trust region methods address this by constraining each update to stay close to the current policy.

PPO Schulman et al., 2017 enforces a soft trust region via a clipped surrogate objective. Define the probability ratio

ρt(θ)=πθ(atst)πθold(atst).\rho_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}.

The PPO objective clips this ratio to prevent large updates:

LPPO(θ)=Et ⁣[min ⁣(ρt(θ)A^t,  clip(ρt(θ),1ϵ,1+ϵ)A^t)],\mathcal{L}^{\text{PPO}}(\theta) = \mathbb{E}_t\!\left[\min\!\left(\rho_t(\theta)\, \hat{A}_t,\; \mathrm{clip}(\rho_t(\theta),\, 1-\epsilon,\, 1+\epsilon)\, \hat{A}_t\right)\right],

where ϵ(0.1,0.3)\epsilon \in (0.1, 0.3) is a hyperparameter. When A^t>0\hat{A}_t > 0 the clip prevents the policy from increasing πθ(atst)\pi_\theta(a_t \mid s_t) beyond (1+ϵ)πθold(1+\epsilon)\pi_{\theta_{\text{old}}}; when A^t<0\hat{A}_t < 0 it prevents a decrease below (1ϵ)πθold(1-\epsilon)\pi_{\theta_{\text{old}}}. PPO is simple to implement, stable, and is the dominant algorithm in large-scale RL applications.

Planning as Inference and Maximum Entropy RL

Control as Inference

The probabilistic graphical models we have studied throughout this course suggest a natural reframing of RL: rather than treating reward maximization as an optimization problem, we can treat it as posterior inference in a suitably defined generative model.

Introduce a sequence of binary optimality variables Ot{0,1}\mathcal{O}_t \in \{0, 1\} with the likelihood:

p(Ot=1st,at)=exp ⁣(r(st,at)α),p(\mathcal{O}_t = 1 \mid s_t, a_t) = \exp\!\left(\frac{r(s_t, a_t)}{\alpha}\right),

where α>0\alpha > 0 is a temperature parameter (requiring r0r \leq 0, or equivalently working with exponentiated rewards after a suitable shift). The full generative model over trajectories and optimality variables is:

p(τ,O1:T)=p(s0)t=1T[π(atst)p(st+1st,at)exp ⁣(r(st,at)α)].p(\tau, \mathcal{O}_{1:T}) = p(s_0) \prod_{t=1}^T \left[\pi(a_t \mid s_t)\, p(s_{t+1} \mid s_t, a_t)\, \exp\!\left(\frac{r(s_t, a_t)}{\alpha}\right)\right].

The RL objective — finding a policy that maximizes expected reward — is equivalent to computing the posterior distribution over trajectories given that all optimality variables are 1:

p(τO1:T=1)p(s0)tp(st+1st,at)exp ⁣(r(st,at)α).p(\tau \mid \mathcal{O}_{1:T} = 1) \propto p(s_0) \prod_t p(s_{t+1} \mid s_t, a_t)\, \exp\!\left(\frac{r(s_t, a_t)}{\alpha}\right).

This is analogous to the smoothing distribution in a hidden Markov model, computed by a backward message-passing algorithm. Define the backward messages:

βt(st)=p(Ot:T=1st)=atπ(atst)er(st,at)/αEst+1[βt+1(st+1)].\beta_t(s_t) = p(\mathcal{O}_{t:T} = 1 \mid s_t) = \sum_{a_t} \pi(a_t \mid s_t)\, e^{r(s_t, a_t)/\alpha}\, \mathbb{E}_{s_{t+1}}[\beta_{t+1}(s_{t+1})].

Taking logarithms and defining the soft value function Vsoft(s)=αlogβ(s)V_\mathrm{soft}(s) = \alpha \log \beta(s) and soft Q-function Qsoft(s,a)=αlogβ(s,a)Q_\mathrm{soft}(s, a) = \alpha \log \beta(s, a), the backward recursion becomes the soft Bellman equations:

Qsoft(st,at)=r(st,at)+γEst+1[Vsoft(st+1)],Q_\mathrm{soft}(s_t, a_t) = r(s_t, a_t) + \gamma\, \mathbb{E}_{s_{t+1}}[V_\mathrm{soft}(s_{t+1})],
Vsoft(st)=αlogaexp ⁣(Qsoft(st,a)α).V_\mathrm{soft}(s_t) = \alpha \log \sum_{a} \exp\!\left(\frac{Q_\mathrm{soft}(s_t, a)}{\alpha}\right).

The second equation replaces the hard max\max of standard Bellman optimality with a soft-max (log-sum-exp), recovering the hard-max in the limit α0\alpha \to 0.

Maximum Entropy RL

The inference perspective has an equivalent variational characterization. Define the maximum entropy RL objective:

JMaxEnt(π)=Eπ ⁣[t=0Tγt(r(st,at)+αH[π(st)])],J_\mathrm{MaxEnt}(\pi) = \mathbb{E}_\pi\!\left[\sum_{t=0}^T \gamma^t \Big(r(s_t, a_t) + \alpha\, \mathcal{H}[\pi(\cdot \mid s_t)]\Big)\right],

where H[π(s)]=aπ(as)logπ(as)\mathcal{H}[\pi(\cdot \mid s)] = -\sum_a \pi(a \mid s) \log \pi(a \mid s) is the policy entropy. The entropy bonus αH[π]\alpha \mathcal{H}[\pi] simultaneously:

  • Encourages exploration: the policy is penalized for being overly deterministic.

  • Improves robustness: the policy learns to hedge against uncertainty.

  • Avoids premature commitment: multiple near-optimal strategies are maintained.

The unique optimal policy for JMaxEntJ_\mathrm{MaxEnt} is the Boltzmann / softmax policy:

π(as)=exp ⁣(Qsoft(s,a)/α)aexp ⁣(Qsoft(s,a)/α).\pi^*(a \mid s) = \frac{\exp\!\left(Q_\mathrm{soft}^*(s, a) / \alpha\right)}{\sum_{a'} \exp\!\left(Q_\mathrm{soft}^*(s, a') / \alpha\right)}.

This is a Gibbs distribution over actions, with the soft Q-function playing the role of an energy. As α0\alpha \to 0 the policy concentrates on the greedy action; as α\alpha \to \infty it approaches the uniform distribution.

Connection to KL Regularization

The MaxEnt objective can be written equivalently as a KL minimization. For a reference policy πref\pi_\mathrm{ref}, the KL-regularized objective

maxπ  Eπ[r(s,a)]αDKL(ππref)\max_\pi \; \mathbb{E}_\pi[r(s, a)] - \alpha\, D_\mathrm{KL}(\pi \,\|\, \pi_\mathrm{ref})

has the same Boltzmann-form optimal solution with πref\pi_\mathrm{ref} in place of the uniform distribution:

π(as)πref(as)exp ⁣(r(s,a)α).\pi^*(a \mid s) \propto \pi_\mathrm{ref}(a \mid s)\, \exp\!\left(\frac{r(s, a)}{\alpha}\right).

This is precisely the RLHF objective studied in the next section. The KL penalty to the reference (SFT) language model is the MaxEnt RL entropy bonus, with the entropy measured relative to πref\pi_\mathrm{ref} rather than the uniform distribution. The closed-form optimal RLHF policy derived earlier is the Boltzmann policy of this MaxEnt problem.

Soft Actor-Critic

In practice, Soft Actor-Critic (SAC) implements MaxEnt RL for continuous action spaces by maintaining:

  • An actor πθ(as)\pi_\theta(a \mid s) trained to minimize DKL ⁣(πθ(s)exp(Qϕ(s,)/α)Z(s))D_\mathrm{KL}\!\left(\pi_\theta(\cdot \mid s) \,\Big\|\, \frac{\exp(Q_\phi(s,\cdot)/\alpha)}{Z(s)}\right), which simplifies to maximizing Eaπθ[Qϕ(s,a)αlogπθ(as)]\mathbb{E}_{a \sim \pi_\theta}[Q_\phi(s, a) - \alpha \log \pi_\theta(a \mid s)].

  • A critic Qϕ(s,a)Q_\phi(s, a) trained on the soft Bellman residual, using a target network for stability.

  • An optional automatic entropy tuning step that adjusts α\alpha to maintain a target entropy H\mathcal{H}^*.

SAC is the dominant off-policy MaxEnt RL algorithm for continuous control, achieving strong sample efficiency through experience replay and the stabilizing effect of the entropy bonus.

Reinforcement Learning from Human Feedback

The Language Model as a Policy

A language model πθ(yx)\pi_\theta(y \mid x) can be viewed as a policy in an MDP where:

  • the state is the prompt xx (and tokens generated so far),

  • the action is the next token aVa \in \mathcal{V} (vocabulary),

  • the episode ends at an end-of-sequence token, yielding a full response yy.

The transition dynamics are deterministic (the generated token becomes part of the state), so the stochasticity in the MDP comes entirely from πθ\pi_\theta.

Reward Modeling from Human Preferences

Many desirable properties of language model outputs — helpfulness, harmlessness, honesty — are difficult to specify as an explicit reward function. RLHF Christiano et al., 2017 Ouyang et al., 2022 sidesteps this by learning a reward model from human preference data.

Step 1: Collect preference data. Present human annotators with pairs of responses (y+,y)(y^+, y^-) to the same prompt xx, and ask which is preferred. The Bradley–Terry model for pairwise preferences gives:

p(y+yx)=σ ⁣(rϕ(x,y+)rϕ(x,y)),p(y^+ \succ y^- \mid x) = \sigma\!\left(r_\phi(x, y^+) - r_\phi(x, y^-)\right),

where rϕ(x,y)Rr_\phi(x, y) \in \mathbb{R} is a learned reward model and σ\sigma is the sigmoid. The reward model is trained by maximum likelihood on the preference dataset D={(x,y+,y)}\mathcal{D} = \{(x, y^+, y^-)\}:

LRM(ϕ)=E(x,y+,y)D ⁣[logσ ⁣(rϕ(x,y+)rϕ(x,y))].\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x,\, y^+,\, y^-) \sim \mathcal{D}}\!\left[\log \sigma\!\left(r_\phi(x, y^+) - r_\phi(x, y^-)\right)\right].

Step 2: Fine-tune with RL. Using rϕr_\phi as the reward signal, fine-tune the language model πθ\pi_\theta with PPO to maximize expected reward Ziegler et al., 2019. To prevent the policy from straying too far from the supervised fine-tuned (SFT) reference model πref\pi_{\text{ref}}, a KL penalty is added:

maxθ  ExD,yπθ(x) ⁣[rϕ(x,y)βDKL ⁣(πθ(x)πref(x))],\max_\theta \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)}\!\left[r_\phi(x, y) - \beta \, D_{\text{KL}}\!\left(\pi_\theta(\cdot \mid x) \,\|\, \pi_{\text{ref}}(\cdot \mid x)\right)\right],

where β>0\beta > 0 controls the strength of the KL penalty. The KL term serves two purposes: it prevents reward hacking (exploiting the reward model), and it ensures the language model retains its general capabilities.

The KL-penalized objective has a closed-form optimal solution. Let Z(x)=yπref(yx)erϕ(x,y)/βZ(x) = \sum_y \pi_{\text{ref}}(y \mid x) e^{r_\phi(x,y)/\beta} be the partition function. Then:

π(yx)=1Z(x)πref(yx)exp ⁣(rϕ(x,y)β).\pi^*(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp\!\left(\frac{r_\phi(x, y)}{\beta}\right).

This is a tilted or exponentially weighted version of the reference policy — a familiar form from the importance sampling and energy-based model literatures.

Direct Preference Optimization

Direct Preference Optimization (DPO) Rafailov et al., 2024 observes that, given the closed-form optimal policy above, the reward model rϕr_\phi can be expressed directly in terms of the language model:

rϕ(x,y)=βlogπ(yx)πref(yx)+βlogZ(x).r_\phi(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x).

Substituting into the Bradley–Terry preference model and noting that logZ(x)\log Z(x) cancels in the difference r(x,y+)r(x,y)r(x, y^+) - r(x, y^-), the preference probability becomes:

p(y+yx)=σ ⁣(βlogπθ(y+x)πref(y+x)βlogπθ(yx)πref(yx)).p(y^+ \succ y^- \mid x) = \sigma\!\left(\beta \log \frac{\pi_\theta(y^+ \mid x)}{\pi_{\text{ref}}(y^+ \mid x)} - \beta \log \frac{\pi_\theta(y^- \mid x)}{\pi_{\text{ref}}(y^- \mid x)}\right).

This allows the language model πθ\pi_\theta to be trained directly from preference data by maximizing this likelihood — without any separate reward model or RL loop:

LDPO(θ)=E(x,y+,y)D ⁣[logσ ⁣(βlogπθ(y+x)πref(y+x)βlogπθ(yx)πref(yx))].\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x,\, y^+,\, y^-)\, \sim\, \mathcal{D}}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y^+ \mid x)}{\pi_{\text{ref}}(y^+ \mid x)} - \beta \log \frac{\pi_\theta(y^- \mid x)}{\pi_{\text{ref}}(y^- \mid x)}\right)\right].

DPO is conceptually cleaner and computationally cheaper than PPO-based RLHF, but the two approaches have different empirical trade-offs that remain an active area of research.

Source
# Illustrate the DPO loss as a function of the relative log-ratio.
beta_val = 0.1
delta = torch.linspace(-5, 5, 200)  # beta * (log pi(y+)/pi_ref(y+) - log pi(y-)/pi_ref(y-))

dpo_loss = -torch.log(torch.sigmoid(delta))

fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(delta.numpy(), dpo_loss.numpy(), color='tab:blue', lw=2, label='DPO loss')
ax.axvline(0, ls='--', color='gray', lw=1, label='SFT init')
ax.set_xlabel('Scaled relative log-ratio')
ax.set_ylabel('Loss')
ax.set_title('DPO loss: prefers $y^+$ over $y^-$ when ratio > 0')
ax.legend()
ax.set_xlim(-5, 5)
plt.tight_layout()
plt.savefig('dpo_loss.png', dpi=100, bbox_inches='tight')
plt.show()
<Figure size 700x400 with 1 Axes>

Connections and Further Topics

Connections to Probabilistic Inference

The planning-as-inference and maximum entropy RL perspectives developed in the previous section connect RL to the message-passing algorithms (Chapter 14), variational inference (Chapter 10), and energy-based models studied elsewhere in this course. The Boltzmann optimal policy ππrefexp(r/α)\pi^* \propto \pi_\mathrm{ref} \exp(r/\alpha) is an exponential family tilting of the reference, familiar from importance sampling and the exponential family posteriors of Chapter 3.

Reward Hacking and Alignment

A central challenge in RLHF is reward hacking: the language model learns to exploit weaknesses in the learned reward model, producing outputs that score highly according to rϕr_\phi but are not actually preferred by humans. The KL penalty βDKL(πθπref)\beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) mitigates this, and the optimal β\beta trades off reward maximization against staying close to the reference. The DPO derivation makes this trade-off explicit.

Group Relative Policy Optimization (GRPO)

Recent work on reasoning models (e.g. DeepSeek-R1) uses GRPO, a variant that estimates the advantage by comparing multiple sampled responses to the same prompt within a batch, rather than using a learned critic. For a prompt xx, sample GG responses {y(g)}g=1G\{y^{(g)}\}_{g=1}^G and compute:

A^(g)=r(x,y(g))meangr(x,y(g))stdgr(x,y(g)).\hat{A}^{(g)} = \frac{r(x, y^{(g)}) - \mathrm{mean}_g r(x, y^{(g)})}{\mathrm{std}_g r(x, y^{(g)})}.

This avoids training a separate value network while still providing a low-variance baseline for the policy gradient.

Summary

AlgorithmKey ideaMain use case
REINFORCEMonte Carlo policy gradientSimple environments, language bandit problems
Actor-CriticTD-error advantage, learned criticContinuous control
PPOClipped ratio trust regionLarge-scale RL, RLHF
DPOClosed-form RL from preferencesLanguage model alignment
GRPOGroup-relative advantage estimateReasoning model training

Conclusion

Reinforcement learning addresses sequential decision-making under uncertainty: an agent interacts with an environment, receives rewards, and must learn a policy that maximizes expected cumulative return. The policy gradient theorem connects the gradient of the expected return to on-policy rollouts, and this insight underlies REINFORCE, actor-critic methods, and the proximal policy optimization (PPO) algorithm used to fine-tune large language models. The planning-as-inference perspective — treating optimal behavior as posterior inference in a graphical model — provides deep connections to variational inference and message passing. Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) apply these ideas to align language models with human preferences, casting alignment as a KL-regularized reward maximization problem.

References
  1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
  2. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256.
  3. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv Preprint arXiv:1707.06347.
  4. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.
  5. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & others. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.
  6. Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). Fine-tuning language models from human preferences. arXiv Preprint arXiv:1909.08593.
  7. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.