deepseek's high level software magic
Reinforcement Learning Definitions - Exact Gradient - Deriving the Policy Gradient Objective - Proximal Policy Optimization (PPO) - In Context of Language Models - Group-Relative Policy Optimization (GRPO)
This post assumes that readers are familiar with basic definitions in reinforcement learning, including states, actions, rewards, discount factor, policy, and value function. If not, the first chapter of Barto and Sutton does a great job at covering these.
The Goal: Maximizing Return
The goal of any RL task is to maximize the total reward over all future time steps, which is called the return.
Mathematically, we can express this as
Calculating this directly is intractable (too many time steps, too many possible branches). This motivates us to learn a value function, which estimates the future return given a state.
The return under policy , given our value function, is
where is the stationary distribution of states under policy .
Here, represents the parameters that control the policy. In the context of language models, are the weights of the language model, while the policy describes what the language model outputs. With this lens, it makes sense why we want to perform gradient ascent on to maximize expected return.
Deriving the Policy Gradient
To perform gradient ascent, we need a way to estimate the gradient. We start from the exact gradient , equivalent to
We'll show that an equivalent but far more tractable expression for is
Proof:
Let the trajectory probability for factorize as
Only the policy term depends on , so
Using the identity , we obtain
which is the expression we derived above.
The Variance Problem
Vanilla policy gradient works in theory. In practice, however, the high variance makes for unstable gradient updates. This motivates the idea of only accepting gradient updates within a certain acceptable range—a "trust region." This is the key idea behind Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO).
We'll focus on PPO because while TRPO uses an expensive second-order method (Fisher-vector-product) to enforce the trust region, PPO uses a simple first-order clipping operation.
Proximal Policy Optimization (PPO)
For PPO, let's first define the policy ratio:
The PPO objective is
Let's break down what this means:
- If : the new policy gives the action higher probability than the old policy
- If : the new policy gives the action lower probability than the old policy
- If : the action was better than expected (positive advantage)
- If : the action was worse than expected (negative advantage)
The clipping ensures that if the policy ratio falls outside our trust region of (1-\arepsilon, 1+\arepsilon), the gradient of becomes 0. This is the 'proximal' part of PPO: we don't allow drastic updates that could be destructive.
Connection to Vanilla Policy Gradient:
If we take the gradient of , we see that it reduces to vanilla policy gradient:
In the last step, we can cancel the policy ratio because for the on-policy case, , making the ratio always 1.
Understanding Advantage
The advantage represents how much better or worse an action was compared to the policy's default behavior. Mathematically, it's , where:
- is the action-value function: the expected return from taking action in state
- is the state-value function: the expected return from state
In practice, we estimate using the actual return , since . For language models, the return of a response is scored by a reward model—a separate network that assigns scores to completed responses.
In PPO, we compute with a critic network—extra architecture attached to the base model. For RLHF, this critic requires around 10% more FLOPs and memory. This value function assigns scores to incomplete responses (partial trajectories).
Group-Relative Policy Optimization (GRPO)
GRPO takes a different approach by removing the critic network entirely. Instead, GRPO:
- Generates responses to the same prompt
- Calculates the advantage of each response based on the average return of the response batch
- Uses this group-relative advantage for policy updates
This trades the memory overhead of a critic network for the compute cost of generating extra responses, while gaining the simplicity of removing an entire model from the architecture.
The key insight is that we can estimate advantages relative to the group performance rather than needing a separate value function approximator.
Conclusion
We've traced the path from basic policy gradients to modern techniques like PPO and GRPO. The core challenge remains the same: how to efficiently estimate gradients for policy improvement while maintaining stability. PPO addresses this through trust regions and clipping, while GRPO offers a critic-free alternative that trades compute for architectural simplicity.
Sources:
- Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2017). Trust Region Policy Optimization. arXiv preprint arXiv:1502.05477. https://arxiv.org/abs/1502.05477
- Lambert, N. (2024). Reinforcement Learning from Human Feedback. Online. https://rlhfbook.com
- Title inspired by meekolabs' DeepSeek's Low-Level Hardware Magic. Go check them out!