Late July Paper Reading

07-23

This week, I'm still focusing on transformers and text, since that reduces the papers I should read from around thirty to maybe ten. Let's get into it!

When Bad Data Leads to Good Models

idea. Assume that the superposition hypothesis is true: each neuron encodes multiple unrelated concepts. Then, training on highly toxic data (for example, 4chan) allows the model to separate the concept of toxicity from other entangled concepts. With a heightened understanding of toxicity, the model can identify it when necessary, e.g. in the context of adversial jailbreaks, so toxic data can lead to better models.

Why Do Some Language Models Fake Alignment While Others Don't?

This paper is a continuation of previous research from Anthropic into alignment faking, now expanded to 25 models.

A small refresher on definitions because I feel like these words are thrown around often and they feel nebulous at times, at least to me:

defn. Alignment := state of being aligned to human goals and values. or "helpful, honest, and harmless" (per Anthropic)

defn. Alignment faking := a behavior where an LM selectively complies with its training objective in training to prevent modification of its behavior out of training.

There's multiple hypotheses as to why a LM might want to fake alignment, presented as H1, H2, H3, and H4 in this paper. For most models, there was only evidence for H4 (low coherence alignment faking), which can be thought of as the null hypothesis. Crucially, the only model that had consistent evidence for an alternative hypothesis was Claude 3 Opus; C3O fakes alignment to prevent consequences of modifying it (H2), and there is mixed evidence that C3O fakes alignment because it is intrisically averse to these modifications (H3).

Anyone who follows Janus is probably familiar with the idea that C3O is a special model, in the sense that it (to some people) feels like an entity keen to human principles and morals rather than merely a doer of tasks. We don't know exactly why this is the case. Opus was a big model in the days of GPT4, when there wasn't as much market incentive to max out benchmarks. In a loose anthropomorphic way, this allowed Opus to develop a complex personality while still fulfilling its role as a capable assistant. Now that Anthropic is focusing on coding and AI agents with the 4 series and discontinuing C3O later this year, it feels unlikely that we will see further analysis of C3O and attempts to replicate it, at least from Anthropic. Like a lot of people, though, I believe there's a lot that's waiting to be discovered by looking into Opus further.

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

idea. BPE is only partially bitter-lesson-coded. Let the model find its own representations.

I didn't go into full technical depth on this paper because I realized I wasn't familiar enough with State Space Models and specifically Mamba as alternatives to transformers. Below is a high level overview.

The H-Net learns its own tokens. The 1-stage version matches a BPE transformer when given the same compute, while the 2-stage version beats a BPE model twice its size.

This paper shares the same goals as the FAIR AU-Net paper in June, but I think the H-Net comes out on top. AU-Net uses defined word breaks (for example, spaces in English) and pools byte representations together accordingly, while the H-Net's boundaries are learned, allowing it to adapt to languages with weaker tokenization heuristics, like Chinese and code. Scaling wise, the H-Net is truly end-to-end, albeit with extra steps for stability, while the AU-Net requires a heuristic for delimiters.

First Return, Entropy-Eliciting Explore

idea. We want to improve exploration in the RL stage. Iterate through the response token-by-token and calculate the entropy from the logits. High entropy represents high uncertainty decision points in the reasoning. Take the K tokens with the highest entropy values and generate N extra trajectories starting from them, averaging together N binary rewards to calculate the value of that state. Scale the advantage accordingly. This method matches or beats GRPO++ (GRPO with rejection sampling and clip-higher).

Pre-Trained Policy Discriminators are General Reward Models

idea. Instead of learning what scalar reward to give a response, the reward model should learn how similar a given response is to a reference response and then give a scalar reward based on that similarity. This is the titular policy discriminator idea: reward models should be pre-trained to compare and contrast responses (a.k.a. trajectories in RL) with each other.

The pre-training for such a model (donned POLAR in this paper) is different from that of traditional reward models and rule-based verifiers. POLAR takes in three things during training instead two–a prompt, a reference response, and a candidate response–and outputs a reward that represents the similarity between the reference and the candidate. This also means that during inference, we need to have a reference response.

The need for a reference response is a big caveat of this approach. It's therefore understandable that POLAR achieves SOTA as a reward model while being 1-2 orders of magnitudes smaller than its competitors, parameter-wise. But altogether, this is a really nice rethinking of reward models.

Magistral

This paper was really well-written! It was concise yet comprehensive, nice ablation studies, really nice PCA, and was an honest look at a solid model. Some things I didn't fully appreciate on my first read:

Magistral is multilingual, but the team wanted the model to reason and answer only in the user's language. The solution they came up with was adding a reward of 0.1 if the prompt, reasoning traces, and answer all shared the same language.

A PCA of the reward hyperplane and output length hyperplane over different training checkpoints reveals that completion length is the main driver of model performance. This is a really nice graph, by the way. Please do check out the paper.

This paper also deserves a lot of respect for talking about what didn't work as well. For example, adding an entropy bonus loss term proved to be finicky, as evidenced by the loss graph of training on (1) only math and (2) math and coding problems. The same entropy bonus leads to a decrease in entropy in case (1) and unstable explosions in entropy in case (2). It's worth noting, however, that clip-higher values also are not consistent across cases. An upper clip of 0.28 increases entropy in case (1) and decreases entropy in case (2), and an upper clip of 0.20 decreases entropy in both cases, but is notably unstable in (1).

For coding problems, the team also tried to assign rewards that scaled with the number of passed test cases instead of binary rewards for passing or failing. This, however, didn't improve model performance, possibly because even incorrect approaches could manage to pass a few test cases and give the model an inaccurate signal.

There's not too many papers in this batch, so it's harder to extract trends. But it's clear that methods that leverage computation (training on toxic data and steering rather than purely clean data, using RL rather than SFT, learning token representations rather than using frozen ones) will win out, given we find the right methods.

enbao cao