October Paper Reading

10-04

TL;DR - more work on agents, bridging the non-verifiable gap, sft vs rl, objective function variants, scheming, lora again

Teaching LLMs to Plan

idea. We can formalize reasoning with Planning Domain Definition Language (PDDL) - we can fine-tune LLMs to adhere to this template (i.e. explicitly think about preconditions, effects, and validity).

think: analogous to fine-tuning LLMs to prove in Lean rather than natural language in the sense that there is a formal language and a checker.

results. These tuned models achieve a 66% improvement in planning accuracy when compared over baseline models.

Compute as Teacher

question. How do we find a signal when we don't have ground-truth answers?

motivation. Some responses omit crucial parts, and others assert false facts. If we look at many responses, we can see other responses with crucial parts intact and without contradictions.

idea. Generate a group of rollouts. Use a frozen copy of the LLM to synthesize responses - coalesce useful information and resolve contradictions to make one reference answer. Depending on if the task is verifiable or not, the LLM now converts this answer into an exact answer or a rubric, respectively. Repeat this process.

results. This works on models like 4B and 8B versions of the Llama-3.1, Gemma, and Qwen families.

think: In some ways, this process can outperform majority voting. If most of the responses miss a crucial step and arrive at the wrong answer but one response realizes that step, the model can synthesize and include that crucial step in the reference answer.

AceReason-Nemotron 1.1

question. How well do SFT and RL work together?

results. Useful ways to scale SFT: unique prompts and responses per prompt, but especially unique prompts. Starting from a better SFT checkpoint yields better RL results. The gap shrink during RL but still remain. RL gains are substantial. RL sampling should balance exploitation and exploration (specifically 03). RLing on math improves code performance.

intuition. SFT teaches formats, tool use, CoT habits. RL allows the model to search deeper and self-correct.

Sampling More to Think Less

motivation. GRPO tends to make trajectories longer and longer, decreasing token efficiency.

idea. Group Filter Policy Optimizations (GFPO) reinforces responses that are correct and short. This signal is biased toward concise solutions.

question. Is it really plausible to claim that GFPO outperforms GRPO in both token efficiency and accuracy with the same train-time compute budget? Not sure about this one: I believe GFPO can outperform GRPO when given more train-time compute budget, which is still a great result! Trading some compute for conciseness is valuable. (Worth validating with an experiment.)

The Majority is not always Right

motivation. Same as Compute as Teacher. Majority voting isn't necessarily the best approach.

idea. Make a new model (the aggregator) that generates an aggregated solution from m candidate solutions. This is trained with RLVR/GRPO.

results. Selecting via aggregator model beats selecting by majority voting or reward model.

Model Scheming

defn. scheming - pretending to be aligned when secretly pursuing some other agena.

question (unanswered): As models grow in situational awareness, how can we steer them away from scheming?

a potential answer: By monitoring chain of thought, assuming that chain of thought is an accurate look inside the model.

another potential answer: deliberately align the model - "you shouldn't scheme for these fundamental reasons"

worrying... mechanistic interpretability is as important as ever

LoRA without Regret

thinky!

results. Schulman said it best, heavily heavily recommend everyone read this one all the way through. LoRA is great for SFT and amazing for RL.

For supervised fine-tuning on small-to-medium-sized instruction-tuning and reasoning datasets, LoRA performs the same as full fine-tuning.
For datasets that exceed LoRA capacity, LoRA underperforms FullFT. Rather than the loss reaching a distinct floor that it can't go below, LoRA results in worse training eﬃciency that depends on the relationship between model capacity to dataset size. ...
LoRA performs equivalently to FullFT for reinforcement learning even with small ranks. We find that RL requires very low capacity, a result we anticipated based on information-theoretical arguments.

Check out previous paper readings (September, August) or my writings on GRPO and mechanistic interpretability.