mid june paper reading

resa - transparent reasoning models via SAEs

SAE-tuning doesn't mean we can get reasoning models 2000x faster, but it does mean we can transfer reasoning capabilities exceptionally well with sparse autoencoders—without needing to SFT on chain-of-thoughts or slow-converging RL.

normal distillation is when a smaller student model learns to mimic a teacher model by minimizing KL-divergence between their output probability distributions. SAE-tuning can be thought of as a specialized form of distillation (but without the big-to-small component), with the target model as the student and the source model as the teacher. rather than only copying outputs, the target model learns to align its internals with the teacher's sparse reasoning basis.

i'm starting to see mechanistic interpretability lingo for SAE (hooked at specific layers, hookpoints), so there's some MI background i probably need to learn first.

difficulty-targeted online data selection and rollout replay

this paper prioritizes questions of moderate difficulty that are more likely to yield informative learning signals by developing an attention-based framework that requires rollouts for only a small reference set of questions. adaptive difficulty is estimated based on similarity to this reference set.

the rollout replay mechanism reuses recent rollouts—a classic RL idea that lowers per-step computation while maintaining stable updates.

play to generalize

RL on arcade-like games boosts performance on multimodal math benchmarks and multi-discipline questions, suggesting transferable reasoning skills. maybe RL just works too well on anything? like even incorrect rewards can boost performance? i wonder if this performance improvement is actually significant.

thinking vs doing: test-time interaction

allowing more actions during test time (namely backtracking, exploration, dynamic re-planning) can help agents, specifically in the context of web agents.

BAGEL (bytedance seed)

huge paper. previous unified VLMs are typically either a single dense transformer shared by all modalities (which makes optimization difficult), or an external-diffuser setup where an LLM gives a short latent code to a diffusion backbone (which faces an information bottleneck).

BAGEL keeps everything inside a decoder-only transformer and splits it into a mixture-of-transformer-experts (MoT): an understanding expert for text + vision transformer (ViT) tokens and a generation expert that uses variational auto-encoder (VAE) tokens.

the team finds emergent skills after more than three trillion tokens of training on interleaved text + image + video + web tokens: free-form visual manipulation (conceptual style changes, re-imaginings), long-context chain-of-thought editing, ability to have a world model (rotation, future frame prediction), and text rendering.

probably the most technical paper of this batch. i'm not familiar with ViT or VAE yet, and there are other concepts like FlexAttention and FLUX in the mix. but overall, BAGEL's ability to scale is great news for multimodal capabilities.

unsupervised elicitation of language models (anthropic)

another approach to solving RLHF's problems: specifically, human supervision becomes scarce when models operate at superhuman capabilities.

key idea: "when a model is already much stronger than an average human, training it on noisy human labels can actually hide latent capability." this is how their method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision.

an example of a superhuman task is author gender prediction: given two blogs, one written by a male and one by a female, predict which is more likely to be written by a male.

they introduce internal coherence maximization (ICM), an unsupervised algorithm to fine-tune pretrained language models on their own generated labels without external supervision.

"our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts."

this refers to building a Claude 3.5 Haiku-based assistant with two methods:

  • method 1: the reward model (RM) is trained on 400k dialogue comparisons, hand-labeled by professionals (production-grade high-quality human supervision). then fine-tune Claude 3.5 Haiku with RL against the RM.
  • method 2: ICM labels 6k seed examples, which trains a provisional RM that labels the remaining 400k examples. then fine-tune Claude 3.5 Haiku with RL against the RM.

caveat: unsupervised elicitation fails when concepts are not salient. ICM's search objective can only latch onto a concept if the model already has an internal feature for that concept. example: the task is to judge which poem is better, with the hidden rule that a poem mentioning the sun is better than one that doesn't. since the model doesn't originally have the notion that poems about the sun are better, its accuracy equals random guessing.

anthropic dedicates the least resources (among frontier labs) to human labeled data because it doesn't scale. it's amazing they pulled through with this idea and demonstrated that unsupervised elicitation can work just as well as human feedback on salient tasks.

magistral (mistral)

this seems like a great paper, but i don't have the RL background to fully understand it yet. GRPO with modifications: eliminating KL divergence, normalizing the loss, normalizing advantages in each minibatch, introducing clip-higher, and filtering out groups with zero advantage. also nice of them to open-source the system prompt.

cartridges (hazy research)

an alternate approach to putting huge texts in the context window. the team proposes creating a cartridge instead—a smaller trainable KV cache for a specific corpus. these can be concatenated at inference time and compose well, so it's not as finicky as it seems.

cartridges are computationally cheaper than keeping a huge KV cache in memory. you pay the training cost once, then repeatedly save latency and memory.

while the naive method of creating cartridges doesn't perform well compared to normal in-context learning, cartridges made with "self-study" work just as well as ICL.

self-study involves: (1) generating synthetic conversations about the corpus (factual, analytic, summarization, creative, structure-aware queries) and (2) distilling the corpus while training (teacher is model with full document in context, student is model + trainable cartridge) across the synthetic conversations.

i should learn how to distill...

common patterns i noticed:

issues:

RL is resource-intensive (RLHF is expensive and noisy, RLVR is specific and hard to generalize out-of-distribution)

solutions:
  • distillation combined with all kinds of things (SAE, KV cache, large corpora)
  • use LLMs to assign rewards to other LLMs