Late August Paper Reading
08-29TL;DR: improving performance on non-verifiable tasks, rethinking hallucination detection, and a generalizable test-time compute method
Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards
I like this one! The team introduces two new techniques, GenRM (Generative Reward Model) and BRPO (Bootstrapped Relative Policy Optimization), to improve the writing capabilities of LLMs. The goal is bringing non-verifiable tasks like creative writing into the RLVR paradigm.
GenRM takes a prompt, two responses and principles, which are specific instructions like 'emphasize voice consistency' and general writing instructions as well. It produces a critique, a comparative review of these two responses, and assigns floating-point scores from 0 to 10 to each response. Training uses an accuracy reward (does the better response have a larger score?), a format reward to enforce the 0 to 10 range, and a margin weight so that small score gaps are penalized. This margin weight improves the RM's sensitivity to quality when responses are similar.
BRPO is an adaptation of GRPO to pairwise rewards. For each prompt, it randomly selects one rollout as a reference and compares every other rollout against it with the GenRM. The binary reward is then plugged into the GRPO objective. Tokens from the winning responses are reinforced and tokens from the losing responses are suppressed. This works as the advantage because since a rollout is randomly selected, the expectation is zero. The team also borrows dynamic sampling from DAPO; that is, if the chosen reference is an outlier and rewards are either predominantly 1 or -1, the batch is dropped.
The team notes that for writing tasks, reward hacking most commonly comes in the form of length bias (longer responses are scored higher) and redundant explanations (more justification gets scored higher). They note that Writing-Zero is much more concise than a previous model they trained, ScalarRM-GRPO, with 7x shorter explanations and 30% shorter responses. While I really prefer to look at the data, using response length as an indicator for reward hacking is justified (another issue: the writing samples are in Chinese).
As a reward model, GenRM beats Claude 3.5 Sonnet on RewardBench, and Writing-Zero improves significantly over the base model on WritingBench.
Reinforcement Learning with Rubric Anchors
"We construct, to our knowledge, the largest rubric reward system to date, comprising over 10,000 rubrics generated by humans, by various LLMs, or via a hybrid human–LLM collaboration."
I fear that collecting ten thousand rubrics, with the majority requiring some human involvement, doesn't adhere to the Bitter Lesson. I believe it would be promising if these 10000 rubrics were created, say, with LLMs from a small batch of 100 human-written rubrics.
"By examining instances where the reward signal is anomalously high, we systematically identify and categorize recurrent, high-level patterns of reward-hacking behavior. This empirical analysis informs the development of a dedicated Reward Hacking Defense Rubric (shown in Section A.1). This new rubric is not part of the initial training but is synthesized from the observed failure modes and integrated as a supervisory constraint in all subsequent, more complex RL stages."
Observing failure modes and creating a new rubric specifically designed to stop reward hacking is not scalable, I feel? Ideally, if a reward hacking rubric were to exist, it should continually evolve to deter the model's growing attempts at reward hacking, and not need humans to identify reward-hacking behavior along the way.
Enhancing Model Safety through Pretraining Data Filtering
"We identified harmful content using a classifier and pretrained models from scratch on the filtered dataset. This approach reeduced the model's accuracy on a harmful-capabilities evaluation by 33% relative compared to random baseline performance, while preserving its beneficial capabilities."
I like the idea here: if a model never learns CBRN (Chemical, Biological, Radiological, Nuclear) knowledge, it will be harder to elicit it. The result that it maintains perfomance on common benchmarks is really nice too.
That said, I see two big issues. First, the model could reinvent that missing CBRN data from its knowledge of science. Second, if the model has web search, it can find all of this.
I think the method here is solid. I would just add on refusal training with adversarial prompts and other anti-jailbreaking post-training techniques to make a 'safe' model.
The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
This paper basically argues that using ROUGE for hallucination detection is not effective, and a lot of the supposed progress is just Goodharting on ROUGE. The team shows that compared to LLM-as-a-Judge, ROUGE has a 45.9% drop in ranking quality.
ROUGE is a metric for lexical overlap, while we really care about factuality. It penalizes correct but longer answers, it disregards paraphrases and synonyms, and rewards wrong answers with the similar wording ("Season 14 of Grey's Anatomy has 23 episodes" vs "Season 14 of Grey's Anatomy has 24 episodes").
Deep Think with Confidence
Naive parallel thinking: give every rollout an equal vote and finish all K traces.
DeepConf parallel thinking: read token-level confidence (negative average logprob of the top k tokens) as rollout happens, compute a local 'group confidence' over a sliding window of recent tokens, use confidence to drop weak traces early and do confidence-weighted voting.
This test-time compute method is actually really nice because it can be attached onto any model where the weights are known. Whereas other parallel methods require hundreds of extra traces, DeepConf prunes the traces with low confidence and cuts tokens by 20-95%.