deepseek's high level software magic

Reinforcement Learning Definitions - Exact Gradient - Deriving the Policy Gradient Objective - Proximal Policy Optimization (PPO) - In Context of Language Models - Group-Relative Policy Optimization (GRPO)

This post assumes that readers are familiar with basic definitions in reinforcement learning, including states, actions, rewards, discount factor, policy, and value function. If not, the first chapter of Barto and Sutton does a great job at covering these.

The Goal: Maximizing Return

The goal of any RL task is to maximize the total reward over all future time steps, which is called the return.

Mathematically, we can express this as

Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1.G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}.

Calculating this directly is intractable (too many time steps, too many possible branches). This motivates us to learn a value function, which estimates the future return given a state.

Vπ(s)=E[GtSt=s,π].V_{\pi}(s) = \mathbb{E}[G_t \mid S_t = s, \pi].

The return under policy π heta\pi_{\ heta}, given our value function, is

J( heta)=sdπ(s)Vπ(s),J(\ heta) = \sum_s d_{\pi}(s) V_{\pi}(s),

where dπ(s)d_{\pi}(s) is the stationary distribution of states under policy π\pi.

Here,  heta\ heta represents the parameters that control the policy. In the context of language models,  heta\ heta are the weights of the language model, while the policy describes what the language model outputs. With this lens, it makes sense why we want to perform gradient ascent on  heta\ heta to maximize expected return.

θθ+αθJ(θ).\theta \leftarrow \theta + \alpha \nabla_{\theta} J(\theta).

Deriving the Policy Gradient

To perform gradient ascent, we need a way to estimate the gradient. We start from the exact gradient gg, equivalent to

g=θE[R(τ)]=τθpθ(τ)R(τ),where R(τ)=t=0γtrt.g = \nabla_{\theta} \mathbb{E}[R(\tau)] = \sum_{\tau} \nabla_{\theta} p_{\theta}(\tau) R(\tau), \quad \text{where } R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t.

We'll show that an equivalent but far more tractable expression for gg is

g=θJ(θ)=E[t=0θlogπθ(atst)R(τ)].g = \nabla_{\theta} J(\theta) = \mathbb{E}\left[\sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) R(\tau)\right].

Proof:

Let the trajectory probability for  au=(s0,a0,r0,s1,a1,r1,)\ au = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots) factorize as

p heta( au)=p(s0)t=0π heta(atst)P(st+1st,at).p_{\ heta}(\ au) = p(s_0) \prod_{t=0}^{\infty} \pi_{\ heta}(a_t \mid s_t) P(s_{t+1} \mid s_t, a_t).

Only the policy term π heta(atst)\pi_{\ heta}(a_t \mid s_t) depends on  heta\ heta, so

θlogpθ(τ)=t=0θlogπθ(atst).\nabla_{\theta} \log p_{\theta}(\tau) = \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t).

Using the identity θp=pθlogp\nabla_{\theta} p = p \nabla_{\theta} \log p, we obtain

g=τpθ(τ)θlogpθ(τ)R(τ)g = \sum_{\tau} p_{\theta}(\tau) \nabla_{\theta} \log p_{\theta}(\tau) R(\tau)
=E[t=0θlogπθ(atst)R(τ)],= \mathbb{E}\left[\sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) R(\tau)\right],

which is the expression we derived above.

The Variance Problem

Vanilla policy gradient works in theory. In practice, however, the high variance makes for unstable gradient updates. This motivates the idea of only accepting gradient updates within a certain acceptable range—a "trust region." This is the key idea behind Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO).

We'll focus on PPO because while TRPO uses an expensive second-order method (Fisher-vector-product) to enforce the trust region, PPO uses a simple first-order clipping operation.

Proximal Policy Optimization (PPO)

For PPO, let's first define the policy ratio:

r_t(\ heta) = \ rac{\pi_{\ heta}(a_t \mid s_t)}{\pi_{\ heta_{\mathrm{old}}}(a_t \mid s_t)}.

The PPO objective is

J(θ)=E[min(rt(θ)At,clip(rt(θ),1ε,1+ε)At)].J(\theta) = \mathbb{E}\left[\min\left(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) A_t\right)\right].

Let's break down what this means:

  • If rt( heta)>1r_t(\ heta) > 1: the new policy gives the action higher probability than the old policy
  • If rt( heta)<1r_t(\ heta) < 1: the new policy gives the action lower probability than the old policy
  • If At>0A_t > 0: the action was better than expected (positive advantage)
  • If At<0A_t < 0: the action was worse than expected (negative advantage)

The clipping ensures that if the policy ratio falls outside our trust region of (1-\ arepsilon, 1+\ arepsilon), the gradient of J( heta)J(\ heta) becomes 0. This is the 'proximal' part of PPO: we don't allow drastic updates that could be destructive.

PPO Clipping Diagram

Connection to Vanilla Policy Gradient:

If we take the gradient of J( heta)J(\ heta), we see that it reduces to vanilla policy gradient:

θJPPO(θ)=E[θrt(θ)At]\nabla_{\theta} J_{\text{PPO}}(\theta) = \mathbb{E}[\nabla_{\theta} r_t(\theta) A_t]
=E[rt(θ)θlogπθ(atst)At]= \mathbb{E}[r_t(\theta) \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) A_t]
=E[θlogπθ(atst)At].= \mathbb{E}[\nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) A_t].

In the last step, we can cancel the policy ratio because for the on-policy case,  heta= hetaold\ heta = \ heta_{\mathrm{old}}, making the ratio always 1.

Understanding Advantage

The advantage AtA_t represents how much better or worse an action was compared to the policy's default behavior. Mathematically, it's At=Qπ(st,at)Vπ(st)A_t = Q_{\pi}(s_t, a_t) - V_{\pi}(s_t), where:

  • Qπ(s,a)Q_{\pi}(s,a) is the action-value function: the expected return from taking action aa in state ss
  • Vπ(s)V_{\pi}(s) is the state-value function: the expected return from state ss

In practice, we estimate QQ using the actual return RR, since E[R]=Q\mathbb{E}[R] = Q. For language models, the return of a response is scored by a reward model—a separate network that assigns scores to completed responses.

In PPO, we compute V(s)V(s) with a critic network—extra architecture attached to the base model. For RLHF, this critic requires around 10% more FLOPs and memory. This value function assigns scores to incomplete responses (partial trajectories).

PPO Architecture Diagram

Group-Relative Policy Optimization (GRPO)

GRPO takes a different approach by removing the critic network entirely. Instead, GRPO:

  1. Generates gg responses to the same prompt
  2. Calculates the advantage of each response based on the average return of the response batch
  3. Uses this group-relative advantage for policy updates

This trades the memory overhead of a critic network for the compute cost of generating extra responses, while gaining the simplicity of removing an entire model from the architecture.

GRPO Architecture Diagram

The key insight is that we can estimate advantages relative to the group performance rather than needing a separate value function approximator.

Conclusion

We've traced the path from basic policy gradients to modern techniques like PPO and GRPO. The core challenge remains the same: how to efficiently estimate gradients for policy improvement while maintaining stability. PPO addresses this through trust regions and clipping, while GRPO offers a critic-free alternative that trades compute for architectural simplicity.

Sources: