Reinforcement Learning

Reinforcement learning (RL) is the study of how an agent learns to make sequential decisions by interacting with an environment in order to maximize cumulative reward Sutton & Barto, 2018. Unlike supervised learning, RL receives no labeled targets — only scalar feedback signals that may be delayed and sparse. This chapter develops the probabilistic foundations of RL, derives the key policy gradient algorithms, and examines the role of RL in modern language model training.

Markov Decision Processes¶

The standard framework for sequential decision-making is the Markov decision process (MDP), defined by a tuple $(\mathcal{S}, \mathcal{A}, p, r, \gamma)$ :

$\mathcal{S}$ : state space (e.g. positions on a grid, token sequences).
$\mathcal{A}$ : action space (e.g. moves, next tokens).
$p(s' \mid s, a)$ : transition distribution — the probability of reaching state $s'$ from state $s$ after taking action $a$ .
$r(s, a) \in \mathbb{R}$ : reward function — the immediate signal received after taking action $a$ in state $s$ .
$\gamma \in [0, 1)$ : discount factor — how much future rewards are down-weighted relative to immediate ones.

A policy $\pi(a \mid s)$ is a distribution over actions given the current state. Starting from $s_0$ , the agent follows the trajectory

s_0 \xrightarrow{a_0 \sim \pi} s_1 \xrightarrow{a_1 \sim \pi} s_2 \xrightarrow{a_2 \sim \pi} \cdots

(1)

where $s_{t+1} \sim p(\cdot \mid s_t, a_t)$ and reward $r_t = r(s_t, a_t)$ is collected at each step. The return from time $t$ is the discounted sum of future rewards:

G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}.

(2)

The objective is to find a policy that maximizes expected return from the initial state:

J(\pi) = \mathbb{E}_{\tau \sim \pi}[G_0] = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_{t=0}^{T} \gamma^t r_t\right],

(3)

where $\tau = (s_0, a_0, s_1, a_1, \ldots)$ denotes a trajectory sampled under $\pi$ .

Value Functions and the Bellman Equations¶

Two quantities are central to RL algorithms.

The state-value function $V^\pi(s)$ measures the expected return when starting in state $s$ and following $\pi$ thereafter:

V^\pi(s) = \mathbb{E}_{\tau \sim \pi}\!\left[G_t \mid s_t = s\right] = \mathbb{E}_{a \sim \pi(\cdot \mid s),\, s' \sim p(\cdot \mid s,a)}\!\left[r(s, a) + \gamma V^\pi(s')\right].

(4)

The action-value function $Q^\pi(s, a)$ conditions on both state and action:

Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\!\left[G_t \mid s_t = s, a_t = a\right] = r(s, a) + \gamma \mathbb{E}_{s' \sim p(\cdot \mid s, a)}\!\left[V^\pi(s')\right].

(5)

Both satisfy Bellman equations — recursive self-consistency conditions. For the optimal policy $\pi^*$ , these become the Bellman optimality equations:

V^*(s) = \max_a \left[ r(s, a) + \gamma \sum_{s'} p(s' \mid s, a)\, V^*(s') \right], \qquad Q^*(s, a) = r(s, a) + \gamma \sum_{s'} p(s' \mid s, a) \max_{a'} Q^*(s', a').

(6)

The advantage function measures how much better action $a$ is than the average under $\pi$ :

A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s).

(7)

When $A^\pi(s, a) > 0$ , action $a$ is better than average in state $s$ under $\pi$ .

Policy Gradient Methods¶

Policy gradient methods directly parameterize the policy $\pi_\theta(a \mid s)$ and optimize $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G_0]$ by gradient ascent. The key result is the policy gradient theorem Sutton & Barto, 2018:

\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t\right].

(8)

This has an elegant derivation. The probability of a trajectory under $\pi_\theta$ is

p_\theta(\tau) = p(s_0) \prod_{t=0}^T \pi_\theta(a_t \mid s_t)\, p(s_{t+1} \mid s_t, a_t),

(9)

and since the transition probabilities do not depend on $\theta$ ,

\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t).

(10)

Applying the log-derivative trick $\nabla_\theta \mathbb{E}[f] = \mathbb{E}[f \nabla_\theta \log p_\theta]$ then yields the theorem.

REINFORCE¶

The simplest policy gradient algorithm is REINFORCE Williams, 1992. Given a batch of trajectories $\{\tau^{(i)}\}$ sampled from $\pi_\theta$ , the gradient estimate is:

\hat{\nabla}_\theta J(\theta) = \frac{1}{B}\sum_{i=1}^B \sum_{t=0}^{T_i} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)}) \cdot G_t^{(i)}.

(11)

This estimator is unbiased but can have high variance. A common variance-reduction technique subtracts a baseline $b(s_t)$ — typically an estimate of $V^\pi(s_t)$ — from the return:

\hat{\nabla}_\theta J(\theta) = \frac{1}{B}\sum_{i=1}^B \sum_{t=0}^{T_i} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)}) \cdot \left(G_t^{(i)} - b(s_t^{(i)})\right).

(12)

Subtracting the baseline does not introduce bias (it has zero expectation under the policy gradient) but can dramatically reduce variance.

# Multi-armed bandit: compare REINFORCE with Thompson sampling
# K-armed Bernoulli bandit

torch.manual_seed(42)
np.random.seed(42)

K = 5  # number of arms
true_probs = torch.tensor([0.1, 0.3, 0.5, 0.7, 0.9])  # true reward probs
T = 500  # time horizon

# ── Thompson sampling (Bayesian baseline) ──────────────────────────────────────
def thompson_sampling(K, true_probs, T):
    '''Beta-Bernoulli Thompson sampling on a K-armed bandit.'''
    alpha = torch.ones(K)  # Beta(alpha, beta) prior
    beta  = torch.ones(K)
    rewards = []
    for _ in range(T):
        # Sample from posterior
        theta = torch.distributions.Beta(alpha, beta).sample()
        arm = theta.argmax().item()
        reward = (torch.rand(1).item() < true_probs[arm].item())
        rewards.append(float(reward))
        if reward:
            alpha[arm] += 1
        else:
            beta[arm] += 1
    return rewards

# ── REINFORCE on bandit (stateless MDP, one step per episode) ──────────────────
def reinforce_bandit(K, true_probs, T, lr=0.05):
    '''REINFORCE for K-armed bandit: softmax policy, baseline = running mean.'''
    logits = torch.zeros(K, requires_grad=True)
    optimizer = torch.optim.Adam([logits], lr=lr)
    rewards = []
    baseline = 0.0
    alpha_ema = 0.05  # EMA coefficient for baseline
    for t in range(T):
        dist = Categorical(logits=logits)
        arm = dist.sample()
        reward = float(torch.rand(1).item() < true_probs[arm].item())
        rewards.append(reward)
        baseline = (1 - alpha_ema) * baseline + alpha_ema * reward
        loss = -dist.log_prob(arm) * (reward - baseline)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    return rewards

def running_mean(x, w=20):
    return np.convolve(x, np.ones(w)/w, mode='valid')

ts_rewards    = thompson_sampling(K, true_probs, T)
rf_rewards    = reinforce_bandit(K, true_probs, T)

fig, axes = plt.subplots(1, 2, figsize=(11, 4))

# Left: cumulative regret
optimal = true_probs.max().item()
ts_regret = np.cumsum(optimal - np.array(ts_rewards))
rf_regret = np.cumsum(optimal - np.array(rf_rewards))
axes[0].plot(ts_regret, label='Thompson sampling', color='tab:blue')
axes[0].plot(rf_regret, label='REINFORCE', color='tab:orange')
axes[0].set_xlabel('Step'); axes[0].set_ylabel('Cumulative regret')
axes[0].set_title('Cumulative regret on Bernoulli bandit')
axes[0].legend()

# Right: smoothed per-step reward
axes[1].plot(running_mean(ts_rewards), label='Thompson sampling', color='tab:blue')
axes[1].plot(running_mean(rf_rewards), label='REINFORCE', color='tab:orange')
axes[1].axhline(optimal, ls='--', color='k', lw=1, label='Optimal')
axes[1].set_xlabel('Step'); axes[1].set_ylabel('Reward (smoothed)')
axes[1].set_title('Per-step reward (20-step moving average)')
axes[1].legend()

plt.tight_layout()
plt.savefig('bandit_rl.png', dpi=100, bbox_inches='tight')
plt.show()

Actor-Critic Methods and PPO¶

REINFORCE uses the full Monte Carlo return $G_t$ , which is unbiased but high-variance. Actor-critic methods replace $G_t$ with a lower-variance estimate by maintaining a learned critic $\hat{V}_\phi(s)$ alongside the actor $\pi_\theta$ . The actor is updated using the advantage estimate

\hat{A}_t = r_t + \gamma \hat{V}_\phi(s_{t+1}) - \hat{V}_\phi(s_t),

(13)

which is an empirical version of the Bellman residual (also called the TD error). The critic is updated to minimize $(\hat{V}_\phi(s_t) - G_t)^2$ .

Proximal Policy Optimization (PPO)¶

A key practical challenge in policy gradient methods is choosing a step size: too large a step can collapse the policy, and the resulting poor samples make recovery slow. Trust region methods address this by constraining each update to stay close to the current policy.

PPO Schulman et al., 2017 enforces a soft trust region via a clipped surrogate objective. Define the probability ratio

\rho_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}.

(14)

The PPO objective clips this ratio to prevent large updates:

\mathcal{L}^{\text{PPO}}(\theta) = \mathbb{E}_t\!\left[\min\!\left(\rho_t(\theta)\, \hat{A}_t,\; \mathrm{clip}(\rho_t(\theta),\, 1-\epsilon,\, 1+\epsilon)\, \hat{A}_t\right)\right],

(15)

where $\epsilon \in (0.1, 0.3)$ is a hyperparameter. When $\hat{A}_t > 0$ the clip prevents the policy from increasing $\pi_\theta(a_t \mid s_t)$ beyond $(1+\epsilon)\pi_{\theta_{\text{old}}}$ ; when $\hat{A}_t < 0$ it prevents a decrease below $(1-\epsilon)\pi_{\theta_{\text{old}}}$ . PPO is simple to implement, stable, and is the dominant algorithm in large-scale RL applications.

Planning as Inference and Maximum Entropy RL¶

Control as Inference¶

The probabilistic graphical models we have studied throughout this course suggest a natural reframing of RL: rather than treating reward maximization as an optimization problem, we can treat it as posterior inference in a suitably defined generative model.

Introduce a sequence of binary optimality variables $\mathcal{O}_t \in \{0, 1\}$ with the likelihood:

p(\mathcal{O}_t = 1 \mid s_t, a_t) = \exp\!\left(\frac{r(s_t, a_t)}{\alpha}\right),

(16)

where $\alpha > 0$ is a temperature parameter (requiring $r \leq 0$ , or equivalently working with exponentiated rewards after a suitable shift). The full generative model over trajectories and optimality variables is:

p(\tau, \mathcal{O}_{1:T}) = p(s_0) \prod_{t=1}^T \left[\pi(a_t \mid s_t)\, p(s_{t+1} \mid s_t, a_t)\, \exp\!\left(\frac{r(s_t, a_t)}{\alpha}\right)\right].

(17)

The RL objective — finding a policy that maximizes expected reward — is equivalent to computing the posterior distribution over trajectories given that all optimality variables are 1:

p(\tau \mid \mathcal{O}_{1:T} = 1) \propto p(s_0) \prod_t p(s_{t+1} \mid s_t, a_t)\, \exp\!\left(\frac{r(s_t, a_t)}{\alpha}\right).

(18)

This is analogous to the smoothing distribution in a hidden Markov model, computed by a backward message-passing algorithm. Define the backward messages:

\beta_t(s_t) = p(\mathcal{O}_{t:T} = 1 \mid s_t) = \sum_{a_t} \pi(a_t \mid s_t)\, e^{r(s_t, a_t)/\alpha}\, \mathbb{E}_{s_{t+1}}[\beta_{t+1}(s_{t+1})].

(19)

Taking logarithms and defining the soft value function $V_\mathrm{soft}(s) = \alpha \log \beta(s)$ and soft Q-function $Q_\mathrm{soft}(s, a) = \alpha \log \beta(s, a)$ , the backward recursion becomes the soft Bellman equations:

Q_\mathrm{soft}(s_t, a_t) = r(s_t, a_t) + \gamma\, \mathbb{E}_{s_{t+1}}[V_\mathrm{soft}(s_{t+1})],

(20)

V_\mathrm{soft}(s_t) = \alpha \log \sum_{a} \exp\!\left(\frac{Q_\mathrm{soft}(s_t, a)}{\alpha}\right).

(21)

The second equation replaces the hard $\max$ of standard Bellman optimality with a soft-max (log-sum-exp), recovering the hard-max in the limit $\alpha \to 0$ .

Maximum Entropy RL¶

The inference perspective has an equivalent variational characterization. Define the maximum entropy RL objective:

J_\mathrm{MaxEnt}(\pi) = \mathbb{E}_\pi\!\left[\sum_{t=0}^T \gamma^t \Big(r(s_t, a_t) + \alpha\, \mathcal{H}[\pi(\cdot \mid s_t)]\Big)\right],

(22)

where $\mathcal{H}[\pi(\cdot \mid s)] = -\sum_a \pi(a \mid s) \log \pi(a \mid s)$ is the policy entropy. The entropy bonus $\alpha \mathcal{H}[\pi]$ simultaneously:

Encourages exploration: the policy is penalized for being overly deterministic.
Improves robustness: the policy learns to hedge against uncertainty.
Avoids premature commitment: multiple near-optimal strategies are maintained.

The unique optimal policy for $J_\mathrm{MaxEnt}$ is the Boltzmann / softmax policy:

\pi^*(a \mid s) = \frac{\exp\!\left(Q_\mathrm{soft}^*(s, a) / \alpha\right)}{\sum_{a'} \exp\!\left(Q_\mathrm{soft}^*(s, a') / \alpha\right)}.

(23)

This is a Gibbs distribution over actions, with the soft Q-function playing the role of an energy. As $\alpha \to 0$ the policy concentrates on the greedy action; as $\alpha \to \infty$ it approaches the uniform distribution.

Connection to KL Regularization¶

The MaxEnt objective can be written equivalently as a KL minimization. For a reference policy $\pi_\mathrm{ref}$ , the KL-regularized objective

\max_\pi \; \mathbb{E}_\pi[r(s, a)] - \alpha\, D_\mathrm{KL}(\pi \,\|\, \pi_\mathrm{ref})

(24)

has the same Boltzmann-form optimal solution with $\pi_\mathrm{ref}$ in place of the uniform distribution:

\pi^*(a \mid s) \propto \pi_\mathrm{ref}(a \mid s)\, \exp\!\left(\frac{r(s, a)}{\alpha}\right).

(25)

This is precisely the RLHF objective studied in the next section. The KL penalty to the reference (SFT) language model is the MaxEnt RL entropy bonus, with the entropy measured relative to $\pi_\mathrm{ref}$ rather than the uniform distribution. The closed-form optimal RLHF policy derived earlier is the Boltzmann policy of this MaxEnt problem.

Soft Actor-Critic¶

In practice, Soft Actor-Critic (SAC) implements MaxEnt RL for continuous action spaces by maintaining:

An actor $\pi_\theta(a \mid s)$ trained to minimize $D_\mathrm{KL}\!\left(\pi_\theta(\cdot \mid s) \,\Big\|\, \frac{\exp(Q_\phi(s,\cdot)/\alpha)}{Z(s)}\right)$ , which simplifies to maximizing $\mathbb{E}_{a \sim \pi_\theta}[Q_\phi(s, a) - \alpha \log \pi_\theta(a \mid s)]$ .
A critic $Q_\phi(s, a)$ trained on the soft Bellman residual, using a target network for stability.
An optional automatic entropy tuning step that adjusts $\alpha$ to maintain a target entropy $\mathcal{H}^*$ .

SAC is the dominant off-policy MaxEnt RL algorithm for continuous control, achieving strong sample efficiency through experience replay and the stabilizing effect of the entropy bonus.

Reinforcement Learning from Human Feedback¶

The Language Model as a Policy¶

A language model $\pi_\theta(y \mid x)$ can be viewed as a policy in an MDP where:

the state is the prompt $x$ (and tokens generated so far),
the action is the next token $a \in \mathcal{V}$ (vocabulary),
the episode ends at an end-of-sequence token, yielding a full response $y$ .

The transition dynamics are deterministic (the generated token becomes part of the state), so the stochasticity in the MDP comes entirely from $\pi_\theta$ .

Reward Modeling from Human Preferences¶

Many desirable properties of language model outputs — helpfulness, harmlessness, honesty — are difficult to specify as an explicit reward function. RLHF Christiano et al., 2017 Ouyang et al., 2022 sidesteps this by learning a reward model from human preference data.

Step 1: Collect preference data. Present human annotators with pairs of responses $(y^+, y^-)$ to the same prompt $x$ , and ask which is preferred. The Bradley–Terry model for pairwise preferences gives:

p(y^+ \succ y^- \mid x) = \sigma\!\left(r_\phi(x, y^+) - r_\phi(x, y^-)\right),

(26)

where $r_\phi(x, y) \in \mathbb{R}$ is a learned reward model and $\sigma$ is the sigmoid. The reward model is trained by maximum likelihood on the preference dataset $\mathcal{D} = \{(x, y^+, y^-)\}$ :

\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x,\, y^+,\, y^-) \sim \mathcal{D}}\!\left[\log \sigma\!\left(r_\phi(x, y^+) - r_\phi(x, y^-)\right)\right].

(27)

Step 2: Fine-tune with RL. Using $r_\phi$ as the reward signal, fine-tune the language model $\pi_\theta$ with PPO to maximize expected reward Ziegler et al., 2019. To prevent the policy from straying too far from the supervised fine-tuned (SFT) reference model $\pi_{\text{ref}}$ , a KL penalty is added:

\max_\theta \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)}\!\left[r_\phi(x, y) - \beta \, D_{\text{KL}}\!\left(\pi_\theta(\cdot \mid x) \,\|\, \pi_{\text{ref}}(\cdot \mid x)\right)\right],

(28)

where $\beta > 0$ controls the strength of the KL penalty. The KL term serves two purposes: it prevents reward hacking (exploiting the reward model), and it ensures the language model retains its general capabilities.

The KL-penalized objective has a closed-form optimal solution. Let $Z(x) = \sum_y \pi_{\text{ref}}(y \mid x) e^{r_\phi(x,y)/\beta}$ be the partition function. Then:

\pi^*(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp\!\left(\frac{r_\phi(x, y)}{\beta}\right).

(29)

This is a tilted or exponentially weighted version of the reference policy — a familiar form from the importance sampling and energy-based model literatures.

Direct Preference Optimization¶

Direct Preference Optimization (DPO) Rafailov et al., 2024 observes that, given the closed-form optimal policy above, the reward model $r_\phi$ can be expressed directly in terms of the language model:

r_\phi(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x).

(30)

Substituting into the Bradley–Terry preference model and noting that $\log Z(x)$ cancels in the difference $r(x, y^+) - r(x, y^-)$ , the preference probability becomes:

p(y^+ \succ y^- \mid x) = \sigma\!\left(\beta \log \frac{\pi_\theta(y^+ \mid x)}{\pi_{\text{ref}}(y^+ \mid x)} - \beta \log \frac{\pi_\theta(y^- \mid x)}{\pi_{\text{ref}}(y^- \mid x)}\right).

(31)

This allows the language model $\pi_\theta$ to be trained directly from preference data by maximizing this likelihood — without any separate reward model or RL loop:

\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x,\, y^+,\, y^-)\, \sim\, \mathcal{D}}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y^+ \mid x)}{\pi_{\text{ref}}(y^+ \mid x)} - \beta \log \frac{\pi_\theta(y^- \mid x)}{\pi_{\text{ref}}(y^- \mid x)}\right)\right].

(32)

DPO is conceptually cleaner and computationally cheaper than PPO-based RLHF, but the two approaches have different empirical trade-offs that remain an active area of research.

# Illustrate the DPO loss as a function of the relative log-ratio.
beta_val = 0.1
delta = torch.linspace(-5, 5, 200)  # beta * (log pi(y+)/pi_ref(y+) - log pi(y-)/pi_ref(y-))

dpo_loss = -torch.log(torch.sigmoid(delta))

fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(delta.numpy(), dpo_loss.numpy(), color='tab:blue', lw=2, label='DPO loss')
ax.axvline(0, ls='--', color='gray', lw=1, label='SFT init')
ax.set_xlabel('Scaled relative log-ratio')
ax.set_ylabel('Loss')
ax.set_title('DPO loss: prefers $y^+$ over $y^-$ when ratio > 0')
ax.legend()
ax.set_xlim(-5, 5)
plt.tight_layout()
plt.savefig('dpo_loss.png', dpi=100, bbox_inches='tight')
plt.show()

Connections and Further Topics¶

Connections to Probabilistic Inference¶

The planning-as-inference and maximum entropy RL perspectives developed in the previous section connect RL to the message-passing algorithms (Chapter 14), variational inference (Chapter 10), and energy-based models studied elsewhere in this course. The Boltzmann optimal policy $\pi^* \propto \pi_\mathrm{ref} \exp(r/\alpha)$ is an exponential family tilting of the reference, familiar from importance sampling and the exponential family posteriors of Chapter 3.

Reward Hacking and Alignment¶

A central challenge in RLHF is reward hacking: the language model learns to exploit weaknesses in the learned reward model, producing outputs that score highly according to $r_\phi$ but are not actually preferred by humans. The KL penalty $\beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ mitigates this, and the optimal $\beta$ trades off reward maximization against staying close to the reference. The DPO derivation makes this trade-off explicit.

Group Relative Policy Optimization (GRPO)¶

Recent work on reasoning models (e.g. DeepSeek-R1) uses GRPO, a variant that estimates the advantage by comparing multiple sampled responses to the same prompt within a batch, rather than using a learned critic. For a prompt $x$ , sample $G$ responses $\{y^{(g)}\}_{g=1}^G$ and compute:

\hat{A}^{(g)} = \frac{r(x, y^{(g)}) - \mathrm{mean}_g r(x, y^{(g)})}{\mathrm{std}_g r(x, y^{(g)})}.

(33)

This avoids training a separate value network while still providing a low-variance baseline for the policy gradient.

Summary¶

Algorithm	Key idea	Main use case
REINFORCE	Monte Carlo policy gradient	Simple environments, language bandit problems
Actor-Critic	TD-error advantage, learned critic	Continuous control
PPO	Clipped ratio trust region	Large-scale RL, RLHF
DPO	Closed-form RL from preferences	Language model alignment
GRPO	Group-relative advantage estimate	Reasoning model training

Conclusion¶

Reinforcement learning addresses sequential decision-making under uncertainty: an agent interacts with an environment, receives rewards, and must learn a policy that maximizes expected cumulative return. The policy gradient theorem connects the gradient of the expected return to on-policy rollouts, and this insight underlies REINFORCE, actor-critic methods, and the proximal policy optimization (PPO) algorithm used to fine-tune large language models. The planning-as-inference perspective — treating optimal behavior as posterior inference in a graphical model — provides deep connections to variational inference and message passing. Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) apply these ideas to align language models with human preferences, casting alignment as a KL-regularized reward maximization problem.

References¶

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv Preprint arXiv:1707.06347.
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & others. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.
Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). Fine-tuning language models from human preferences. arXiv Preprint arXiv:1909.08593.
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.