deepseek's high level software magic

Reinforcement Learning Definitions - Exact Gradient - Deriving the Policy Gradient Objective - Proximal Policy Optimization (PPO) - In Context of Language Models - Group-Relative Policy Optimization (GRPO)

This post assumes that readers are familiar with basic definitions in reinforcement learning, including states, actions, rewards, discount factor, policy, and value function. If not, the first chapter of Barto and Sutton does a great job at covering these.

The Goal: Maximizing Return

The goal of any RL task is to maximize the total reward over all future time steps, which is called the return.

Mathematically, we can express this as

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}.

Calculating this directly is intractable (too many time steps, too many possible branches). This motivates us to learn a value function, which estimates the future return given a state.

V_{\pi}(s) = \mathbb{E}[G_t \mid S_t = s, \pi].

The return under policy $\pi_{\ heta}$ , given our value function, is

J(\ heta) = \sum_s d_{\pi}(s) V_{\pi}(s),

where $d_{\pi}(s)$ is the stationary distribution of states under policy $\pi$ .

Here, $\ heta$ represents the parameters that control the policy. In the context of language models, $\ heta$ are the weights of the language model, while the policy describes what the language model outputs. With this lens, it makes sense why we want to perform gradient ascent on $\ heta$ to maximize expected return.

\theta \leftarrow \theta + \alpha \nabla_{\theta} J(\theta).

Deriving the Policy Gradient

To perform gradient ascent, we need a way to estimate the gradient. We start from the exact gradient $g$ , equivalent to

g = \nabla_{\theta} \mathbb{E}[R(\tau)] = \sum_{\tau} \nabla_{\theta} p_{\theta}(\tau) R(\tau), \quad \text{where } R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t.

We'll show that an equivalent but far more tractable expression for $g$ is

g = \nabla_{\theta} J(\theta) = \mathbb{E}\left[\sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) R(\tau)\right].

Proof:

Let the trajectory probability for $\ au = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ factorize as

p_{\ heta}(\ au) = p(s_0) \prod_{t=0}^{\infty} \pi_{\ heta}(a_t \mid s_t) P(s_{t+1} \mid s_t, a_t).

Only the policy term $\pi_{\ heta}(a_t \mid s_t)$ depends on $\ heta$ , so

\nabla_{\theta} \log p_{\theta}(\tau) = \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t).

Using the identity $\nabla_{\theta} p = p \nabla_{\theta} \log p$ , we obtain

g = \sum_{\tau} p_{\theta}(\tau) \nabla_{\theta} \log p_{\theta}(\tau) R(\tau)

= \mathbb{E}\left[\sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) R(\tau)\right],

which is the expression we derived above.

The Variance Problem

Vanilla policy gradient works in theory. In practice, however, the high variance makes for unstable gradient updates. This motivates the idea of only accepting gradient updates within a certain acceptable range—a "trust region." This is the key idea behind Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO).

We'll focus on PPO because while TRPO uses an expensive second-order method (Fisher-vector-product) to enforce the trust region, PPO uses a simple first-order clipping operation.

Proximal Policy Optimization (PPO)

For PPO, let's first define the policy ratio:

r_t(\ heta) = \rac{\pi_{\ heta}(a_t \mid s_t)}{\pi_{\ heta_{\mathrm{old}}}(a_t \mid s_t)}.

The PPO objective is

J(\theta) = \mathbb{E}\left[\min\left(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) A_t\right)\right].

Let's break down what this means:

If $r_t(\ heta) > 1$ : the new policy gives the action higher probability than the old policy
If $r_t(\ heta) < 1$ : the new policy gives the action lower probability than the old policy
If $A_t > 0$ : the action was better than expected (positive advantage)
If $A_t < 0$ : the action was worse than expected (negative advantage)

The clipping ensures that if the policy ratio falls outside our trust region of (1-\arepsilon, 1+\arepsilon), the gradient of $J(\ heta)$ becomes 0. This is the 'proximal' part of PPO: we don't allow drastic updates that could be destructive.

Connection to Vanilla Policy Gradient:

If we take the gradient of $J(\ heta)$ , we see that it reduces to vanilla policy gradient:

\nabla_{\theta} J_{\text{PPO}}(\theta) = \mathbb{E}[\nabla_{\theta} r_t(\theta) A_t]

= \mathbb{E}[r_t(\theta) \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) A_t]

= \mathbb{E}[\nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) A_t].

In the last step, we can cancel the policy ratio because for the on-policy case, $\ heta = \ heta_{\mathrm{old}}$ , making the ratio always 1.

Understanding Advantage

The advantage $A_t$ represents how much better or worse an action was compared to the policy's default behavior. Mathematically, it's $A_t = Q_{\pi}(s_t, a_t) - V_{\pi}(s_t)$ , where:

$Q_{\pi}(s,a)$ is the action-value function: the expected return from taking action $a$ in state $s$
$V_{\pi}(s)$ is the state-value function: the expected return from state $s$

In practice, we estimate $Q$ using the actual return $R$ , since $\mathbb{E}[R] = Q$ . For language models, the return of a response is scored by a reward model—a separate network that assigns scores to completed responses.

In PPO, we compute $V(s)$ with a critic network—extra architecture attached to the base model. For RLHF, this critic requires around 10% more FLOPs and memory. This value function assigns scores to incomplete responses (partial trajectories).

Group-Relative Policy Optimization (GRPO)

GRPO takes a different approach by removing the critic network entirely. Instead, GRPO:

Generates $g$ responses to the same prompt
Calculates the advantage of each response based on the average return of the response batch
Uses this group-relative advantage for policy updates

This trades the memory overhead of a critic network for the compute cost of generating extra responses, while gaining the simplicity of removing an entire model from the architecture.

The key insight is that we can estimate advantages relative to the group performance rather than needing a separate value function approximator.

Conclusion

We've traced the path from basic policy gradients to modern techniques like PPO and GRPO. The core challenge remains the same: how to efficiently estimate gradients for policy improvement while maintaining stability. PPO addresses this through trust regions and clipping, while GRPO offers a critic-free alternative that trades compute for architectural simplicity.

Sources:

Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2017). Trust Region Policy Optimization. arXiv preprint arXiv:1502.05477. https://arxiv.org/abs/1502.05477
Lambert, N. (2024). Reinforcement Learning from Human Feedback. Online. https://rlhfbook.com
Title inspired by meekolabs' DeepSeek's Low-Level Hardware Magic. Go check them out!