Reinforcement Learning

Algorithms for training LLMs

There are two frameworks that are commonly used for (post-)training LLMs with RL:

Reinforcement Learning from Human Feedback (RLHF) trains the LLM using RL with rewards derived from a reward model trained on human preferences;
Reinforcement Learning with Verifiable Rewards (RLVR) trains the LLM using RL with rewards derived from rule-based or deterministic verifiers.

The main difference between RLHF and RLVR lies in how we assign rewards: RLHF uses a learned reward model, while RLVR uses verifiable (or rules-based) rewards. We first sample a batch of prompts and generate a completion (or multiple completions) for each prompt in the batch using our current policy. A reward is computed for each completion, which can then be used to derive a policy update using our RL optimizer of choice (GRPO, PPO, etc.).

Reinforcement Learning from Human Feedback (RLHF)

The first form of RL training to be popularized in the LLM domain was RLHF. Early post-ChatGPT LLMs were almost always post-trained using the following three-step alignment procedure (depicted above), as proposed by InstructGPT ( Citation: Ouyang, Wu et al., 2022 Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J. & Lowe, R. (2022). Training language models to follow instructions with human feedback. https://arxiv.org/abs/2203.02155. ) :

Supervised finetuning (SFT) (or instruction finetuning (IFT)) trains the model using next-token prediction over examples of good completions;
A reward model is trained over a human preference dataset;
Reinforcement learning (RL), usually with PPO, is used to finetune the LLM with the reward model as the reward signal.

The second and third steps of this procedure are collectively referred to as RLHF. This framework actually involves two training procedures: a supervised learning phase for the reward model and an RL training phase for the LLM.

Preference data is the foundation of RLHF. Each element of a preference dataset consists of a prompt, two completions to that prompt, and a preference label (assigned either by human or an AI like an LLM judge) indicating which completion is preferred. In general, specifying an explicit reward for an LLM is very difficult: how do we reliably determine whether a completion is “good” or not when the model has so many diverse capabilities?

A reward model is a specialized LLM – usually a copy of the LLM we are training with an added regression head – that is finetuned to predict a human preference score given a prompt and candidate completion as input. Specifically, the reward model is finetuned on a preference dataset using a ranking loss function that is derived from the Bradley-Terry model; see below.

$$L(\theta) = -\log \left( \sigma \left( \text{RM}\theta (x, y_w) - \text{RM}\theta (x, y_l) \right) \right)$$

Put simply, this loss function teaches the reward model to assign a higher score to the preferred response in a preference pair relative to the rejected response.

ChatGPT was extensively aligned via SFT and RLHF, which significantly improved the model’s helpfulness. In this way, RL research – and RLHF in particular – played a pivotal role in creating the impressive and capable LLMs that we have today.

Reinforcement Learning with Verifiable Rewards (RLVR)

RLVR chooses to avoid reward models, instead deriving rewards from manually verifiable and deterministic sources (e.g., rules or heuristics). Using verifiable rewards instead of neural reward models reduces the risk of noisy rewards and makes extensive, large-scale RL training more feasible by making rewards harder to game.

To train an LLM with RLVR, we must select a domain that is verifiable in nature, e.g., math, coding, gameplay. In other words, we need to create a dataset that has either (i) a known ground truth answer or (ii) some rule-based technique that can be used to verify the correctness of an answer for each prompt in our dataset. This cleanly replaces the reward model in RLHF, avoiding the need to train a reward model and use it during RL.

Beyond substituting a reward model with verifiable rewards, the RL component of RLVR is unchanged. However, RLHF and RLVR differ in their purpose and application:

RLHF is usually implemented with PPO as the underlying RL optimizer, while GRPO is the most common RL optimizer for RLVR;
RLHF focuses on LLM alignment with preference feedback, while RLVR is used to improve the complex reasoning capabilities of an LLM.

Optimization algorithms

The abstract

The optimization algorithms that we will discuss are a bit complex – looked at in isolation they combine several pieces (each important) into a single objective which makes it difficult to understand what’s really crucial for the optimization. So let’s first see some essential ingredients for any optimization algorithm:

Rewards and returns

Rewards are assigned to tokens and convey how “good” the tokens are. The reward assigned to the token generated at step $t$ is denoted by $r_t$.

As typical in RL, we don’t want to optimize step-level rewards; instead we want to maximize the cumulative sum of all rewards collected from now until the end of the trajectory. This cumulative sum is called the return, and is denoted by $G_t$. In statistical speak, this is described as expected sum of future discounted rewards:

$$G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$$

where $\gamma$ is the discount factor and controls the infinite sum from blowing up.

KL-penalty

Assume that the reward function being discussed here has some jailbreaks, i.e., the policy can perform degenerate actions to get high rewards. This is perfectly realistic when the reward function is actually a neural network itself, as happens in RLHF.

One way to control this is to say that the base model (starting point for RL) won’t degenerate, so the policy should learn to stay within the confines of what the base model determines to be “reasonable.” This is measured as the KL-divergence between the base model and the current policy:

$$\text{KL}(\pi_\theta, \pi_\text{ref}, t) = \log\frac{\pi_\theta(a_t | s_t)}{\pi_\text{ref}(a_t | s_t)}$$

where $\pi_\text{ref}$ is the reference policy (e.g., the SFT model) and $\pi_\theta$ is the current policy. This is subtracted from the per-token reward to form a new reward signal for RL training:

$$r’t = r_t - \beta \cdot \text{KL}(\pi\theta, \pi_\text{ref}, t)$$

and the modified return is

$$R_t = \sum_{k=0}^{\infty} \gamma^k r’_{t+k}$$

Advantages

If we stop here and use $R_t$ to compute policy gradients, we will get the REINFORCE policy update whose gradient has the term $\nabla_\theta \log \pi_\theta(a_t | s_t)\ G_t$. The scale of gradients here is in the order of $G_t$, which as an unnormalized quantity can cause high variance in the update. E.g., if the returns for two actions are 64 and 100, the gradient update for the second would be significantly larger than for the first.

This high variance problem is resolved by subtracting a baseline from $R_t$ and the overall term is called an advantage:

$$A_t = R_t - b$$

It can be shown that as long as $b$ is not a function of $a_t$, replacing $G_t$ with $A_t$ does not change the expected value of the policy gradient, but it does reduce the variance of the update.

Further the variance is reduced the most (optimal basline) when $b = \mathbb{E}_\theta[R_t]$, i.e., it is the value function $V(s_t)$ leading to:

$$A_t = R_t - V(s_t)$$

Importance ratios

Practically speaking, we want to sample many trajectories per sample prompt (a batch) and perform multiple epochs of updates per batch. If this is done as an offline step, the data becomes off-policy as soon as the first update to performed.

To resolve this, we weight the advantages by the importance ratio (generally useful in statistics when sampling from a target distribution is difficult):

$$\rho_t = \frac{\pi_\theta(a_t | s_t)}{\pi_\text{old}(a_t | s_t)}$$

where $\pi_\text{old}$ is the policy before any updates are performed on the batch.

In practice, we replace $A_t$ with $\rho_t A_t$ to get the policy gradient update.

Clipping

This was introduced in PPO and serves to define a “trust” region for policy updates. The idea is that we want to prevent the policy from changing too much in a single update, which can lead to instability.

Note: This is different from the KL penalty, which encourages the policy to stay close to the reference policy over the course of training, but does not explicitly prevent large updates in a single step.

This is done by clipping the importance ratio $\rho_t$ to be within a certain range (e.g., $[1 - \epsilon, 1 + \epsilon]$). Clipping the importance ratio says “for this update don’t move farther than this from the old policy.”

Mathematically,

$$A_t^\text{clip} = \text{clip}(\rho_t, 1 - \epsilon, 1 + \epsilon) A_t$$

PPO

We now describe PPO specifically.

Generalized Advantage Estimation (GAE)

We said the advantage is $A_t = R_t - V(s_t)$, but this glosses over how to estimate $R_t$. There’s a spectrum of choices:

Monte Carlo: use the actual observed return $\sum_k \gamma^k r_{t+k}$. This is unbiased but high variance because it depends on the entire trajectory of rewards, which can be noisy.
One-step TD: use $r_t + \gamma V(s_{t+1})$. This is low variance but is biased because it relies on the value function’s estimate of the future return, which may be inaccurate.

GAE interpolates between these using a parameter $\lambda \in [0, 1]$. Define the one-step TD error:

$$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

Then the GAE advantage is an exponentially weighted sum of future TD errors:

$$A_t^{\text{GAE}} = \sum_{k=0}^{\infty} (\gamma \lambda)^k \delta_{t+k}$$

At $\lambda = 0$ this reduces to the one-step TD error (low variance, high bias). At $\lambda = 1$ it reduces to the Monte Carlo advantage (high variance, low bias). In practice $\lambda \approx 0.95$ is common — mostly Monte Carlo but with some bootstrapping to tame variance.

Clipping

The clipped surrogate objective is PPO’s signature contribution. Recall the importance-weighted advantage is $\rho_t A_t$. PPO defines:

$$L^{\text{clip}}_t = \min\left(\rho_t A_t,; \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) , A_t\right)$$

The $\min$ is important. Consider the two cases:

$\mathbf{A_t > 0}$ (good action): the unclipped term $\rho_t A_t$ grows as $\rho_t$ increases. Clipping caps it at $(1+\epsilon) A_t$. The $\min$ picks the clipped version once $\rho_t > 1+\epsilon$, zeroing out the gradient — no more incentive to increase this action’s probability.
$\mathbf{A_t < 0}$ (bad action): the unclipped term becomes more negative as $\rho_t$ decreases. Clipping floors it at $(1-\epsilon) A_t$. The $\min$ picks the unclipped version when $\rho_t < 1-\epsilon$, which is more negative, so the gradient still pushes probability down.

The asymmetry is this: PPO doesn’t clip incentives to decrease bad actions, but it does clip incentives to increase good actions to avoid committing too much to a new policy too quickly.

The objective

The full PPO-Clip objective combines three terms for maximization:

$$L^{\text{PPO}}(\theta) = \mathbb{E}t\left[L^{\text{clip}}t - c_1 \cdot (V\phi(s_t) - R_t)^2 + c_2 \cdot H[\pi\theta(\cdot | s_t)]\right]$$

$L^{\text{clip}}_t$ is the clipped surrogate (computed with GAE advantages).
$(V_\phi - R_t)^2$ is the critic loss, which trains the value function to predict returns.
$H[\pi_\theta]$ is an entropy bonus that encourages exploration by rewarding high-entropy action distributions.

The algorithm

The structure of PPO is outlined above ( Citation: Schulman, Wolski et al., 2017 Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347. ) . As we can see, each training iteration of PPO performs the following sequence of steps:

Sample a diverse batch of prompts.
Generate N complete trajectories of T steps for each prompt.
Compute advantage estimates for each completion.
Perform several policy updates using the PPO surrogate objective over this data.

The $K$ epochs of reuse on the same batch are what make PPO more sample-efficient than REINFORCE. Clipping is what makes this safe.

Complexities

PPO is conceptually simple but practically finicky. Some of the headaches:

Four models in memory. You need the policy $\pi_\theta$, the critic, the reference policy $\pi_{\text{ref}}$ (for the KL penalty), and the reward model. For LLMs, these are all large transformers. Memory and throughput become significant engineering problems.
Value function training. The critic $V_\phi$ needs its own loss, its own hyperparameters, and its own failure modes. If it’s badly calibrated, your advantages are wrong, and training destabilizes.
Hyperparameter sensitivity. $\epsilon$, $\lambda$, $\beta$ (KL coefficient), $c_1$, $c_2$, learning rate, number of epochs $K$, minibatch size, rollout size — all interact. Small changes can make the difference between smooth training and collapse.
Distributed training. Rollout and optimization have different compute profiles (generation is memory-bandwidth bound, training is compute bound), which complicates how you split work across GPUs.

PPO works, and it’s still the backbone of most RLHF pipelines, but “works” hides a lot of careful engineering.

GRPO

GRPO (Group Relative Policy Optimization), introduced by DeepSeek, is a simplification of PPO motivated by one observation: the learned value function is the most painful part of PPO, and you can replace it with something much simpler if you’re willing to sample multiple completions per prompt.

The core idea

Instead of learning $V_\phi(s_t)$ to serve as a baseline, GRPO uses the empirical mean reward across a group of completions for the same prompt. For each prompt $x$, sample $G$ completions $y^{(1)}, \dots, y^{(G)}$ from $\pi_{\theta_{\text{old}}}$, score each with the reward model to get $r^{(1)}, \dots, r^{(G)}$, and define the advantage for completion $i$ as:

$$A^{(i)} = \frac{r^{(i)} - \text{mean}(r^{(1)}, \dots, r^{(G)})}{\text{std}(r^{(1)}, \dots, r^{(G)})}$$

Every token in completion $i$ gets the same advantage $A^{(i)}$. There’s no per-token credit assignment, no TD errors, no GAE, no critic.

The intuition: if a completion scored above the group average for its prompt, push the policy toward it; if below, push away. Normalizing by the group’s standard deviation makes the scale consistent across prompts (some prompts are easier, some harder — what matters is relative performance within the group).

The objective

GRPO keeps the clipped surrogate and the KL penalty from PPO, but restructures where the KL lives:

$$L^{\text{GRPO}}(\theta) = \frac{1}{G}\sum_{i=1}^{G}\frac{1}{T_i}\sum_{t=1}^{T_i}\left[\min\left(\rho_t^{(i)} A^{(i)},; \text{clip}(\rho_t^{(i)}, 1-\epsilon, 1+\epsilon) A^{(i)}\right) - \beta , D_{\text{KL}}!\left(\pi_\theta ,|, \pi_{\text{ref}}\right)\right]$$

where $\rho_t^{(i)} = \pi_\theta(y_t^{(i)} | \cdot) / \pi_{\theta_{\text{old}}}(y_t^{(i)} | \cdot)$.

Two things to notice compared to PPO:

No value loss term. There’s no critic, so nothing to train with a $(V_\phi - R_t)^2$ loss.
KL is in the loss, not the reward. PPO folds the KL penalty into the per-token reward (so it flows through advantages). GRPO applies it directly as a separate term in the objective. This is a stylistic difference more than a fundamental one, but it means GRPO’s “advantage” depends only on the reward model, not on KL.

DeepSeek also uses an unbiased KL estimator rather than the naive log-ratio:

$$D_{\text{KL}}(\pi_\theta | \pi_{\text{ref}}) \approx \frac{\pi_{\text{ref}}(y_t|\cdot)}{\pi_\theta(y_t|\cdot)} - \log\frac{\pi_{\text{ref}}(y_t|\cdot)}{\pi_\theta(y_t|\cdot)} - 1$$

which is always non-negative and has lower variance than the naive estimator.

The algorithm

Rollout: for each prompt in the batch, sample $G$ completions from $\pi_{\theta_{\text{old}}}$ (typical $G = 16$ or $64$).
Score: run the reward model on each completion to get a scalar reward.
Compute group-relative advantages: normalize rewards within each prompt’s group.
Optimize: $K$ epochs of minibatch updates on $L^{\text{GRPO}}$.
Set $\theta_{\text{old}} \leftarrow \theta$ and repeat.

No critic forward pass, no critic backward pass, no critic parameters to store or tune.

Tradeoffs vs PPO

What GRPO gives up:

Per-token credit assignment. PPO’s $V_\phi(s_t)$ can say “tokens 1-50 were fine, but things went wrong at token 51.” GRPO gives every token in a completion the same advantage. For long completions with a mix of good and bad segments, this is a coarser signal.
Bootstrapping. GAE uses $V_\phi(s_{t+1})$ to reduce variance by bootstrapping from value estimates. GRPO has no bootstrap — it relies entirely on the empirical group statistics.

What GRPO gains:

No critic. One fewer model to train, store, and tune. For LLM-scale critics, this is a meaningful fraction of total memory and compute. Further, failures in training critic don’t affect the policy updates anymore.
Simpler hyperparameters. No $\lambda$, no $c_1$, no value function learning rate, no critic architecture choices.
Natural fit for verifiable rewards. When the reward is a binary correct/incorrect signal (as in math or code), group-relative normalization gives you a clean “did this completion do better than its peers” signal. This is why GRPO took off with reasoning models — the reward signal is already sequence-level and clean, so per-token credit assignment matters less.

When GRPO works well

Prefer GRPO when:

Rewards are sequence-level anyway (e.g., “did the final answer match?”), so you’re not losing information by assigning the same advantage to every token.
You can afford many samples per prompt. The quality of the baseline estimate depends on $G$ — too few samples and the group mean is a noisy baseline.
You want to avoid the engineering overhead of training a critic.

It’s less clearly better when rewards are genuinely dense and per-token credit assignment matters — though in practice, dense per-token rewards are rare in LLM training.

The DeepSeek R1 results made GRPO popular, but it’s worth remembering that much of what made those results work was the reward design (verifiable correctness on math/code) and scale, not the choice of GRPO over PPO specifically. GRPO is a cleaner algorithm for that setting, not a universal upgrade.

(Bonus) DPO

DPO (Direct Preference Optimization) takes a radically different approach: skip RL entirely. No reward model, no rollouts, no value function, no importance ratios, no clipping. Just a supervised loss on preference pairs.

The core idea

RLHF has two stages: train a reward model on preference data, then run PPO against it. DPO’s insight is that these two stages can be collapsed into one. There’s a closed-form relationship between the optimal policy under a KL-constrained reward maximization objective and the reward function itself:

$$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \text{const}$$

This says: the optimal policy implicitly defines a reward function via its log-ratio with the reference. So instead of fitting a reward model to preferences and then optimizing a policy against it, you can parameterize the reward in terms of the policy directly and fit the policy to preferences in one step.

The objective

Given a preference dataset of $(x, y_w, y_l)$ triples — prompt, preferred (“chosen”) completion, dispreferred (“rejected”) completion — DPO minimizes:

$$L^{\text{DPO}}(\theta) = -\mathbb{E}{(x, y_w, y_l)}\left[\log \sigma!\left(\beta \log \frac{\pi\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$$

where $\sigma$ is the sigmoid. This is just a binary classification loss: increase the policy’s log-ratio over the reference for winners relative to losers. $\beta$ plays a similar role to the KL coefficient in PPO — higher $\beta$ keeps the policy closer to the reference.

Tradeoffs

What DPO gains:

No reward model. One fewer model to train and store.
No rollouts. Training is pure supervised learning on a fixed dataset. No generation during training, which is usually the most expensive part of RLHF.
No PPO machinery. No critics, no GAE, no clipping, no KL scheduling.
Stable. It’s just cross-entropy on log-ratios — the failure modes of RL (reward hacking, value function collapse, KL blowup) mostly don’t apply.
Learn from any preference data. You can train on preferences collected from any source (different model, human, etc.) without worrying about on-policy distribution shift.

What DPO gives up:

Off-policy only. DPO trains on a fixed preference dataset. It can’t explore — it can’t discover that some completion neither annotator saw would have been even better. PPO can, because it generates fresh samples during training.
Dataset coverage matters a lot. If your preference pairs don’t cover the behaviors you want to shape, DPO can’t fix that. PPO with a reward model can generalize beyond the specific completions in the preference data.

When to use it

DPO is the right choice when you have a clean preference dataset, limited compute, and want something that works without heavy engineering. It’s become the default for smaller-scale alignment work and open-source fine-tuning. PPO and GRPO remain preferred when you’re pushing for maximum capability, have compute to burn, or have a reward signal (like verifiable correctness) that benefits from on-policy exploration.

References

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J. & Lowe, R. (2022). Training language models to follow instructions with human feedback. https://arxiv.org/abs/2203.02155.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347.

Algorithms for training LLMs#

Reinforcement Learning from Human Feedback (RLHF)#

Reinforcement Learning with Verifiable Rewards (RLVR)#

Optimization algorithms#

The abstract#

Rewards and returns#

KL-penalty#

Advantages#

Importance ratios#

Clipping#

PPO#

Generalized Advantage Estimation (GAE)#

Clipping#

The objective#

The algorithm#

Complexities#

GRPO#

The core idea#

The objective#

The algorithm#

Tradeoffs vs PPO#

When GRPO works well#

(Bonus) DPO#

The core idea#

The objective#

Tradeoffs#

When to use it#

References#

Algorithms for training LLMs

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning with Verifiable Rewards (RLVR)

Optimization algorithms

The abstract

Rewards and returns

KL-penalty

Advantages

Importance ratios

Clipping

PPO

Generalized Advantage Estimation (GAE)

Clipping

The objective

The algorithm

Complexities

GRPO

The core idea

The objective

The algorithm

Tradeoffs vs PPO

When GRPO works well

(Bonus) DPO

The core idea

The objective

Tradeoffs

When to use it

References