Early July Paper Reading

07-03

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

text, pre-training

idea. transformers use fixed tokenizers (BPE, SentencePiece) to encode inputs into tokens. instead, we can use a model that learns its own embeddings as it trains. let the model decide how it wants to learn.

implementation. decoder-only transformer but it takes in bytes. the U-shape (this naming is from the U-net for image segmentation) contracts bytes into words into two-word chunks, feeds those through eighteen attention layers, and expands it back to byte length. cross-entropy is on the next byte rather than next token.

comments. when trained on the same DCLM dataset, the three-stage AU-Net matches or slightly beats a BPE transformer on language tasks and wins on character-centric tests, losing in GSM8K and TriviaQA. this makes sense, given the benefits/detriments of a learned embedding over a fixed one: numbers and entities are expressed byte-by-byte, which means that the AU-Net needs more compute than BPE to learn these.

LongLLaDa: Unlocking Long Context Capabilities in Diffusion LLMs

text, pre-training

idea. we can try to use context length extension tricks on diffusion LLMs too. this paper 6xs context length while preserving accuracy and costs with a NTK-style rescaling (neural tangent kernel).

comments. need to read up more on diffusion LLMs, neural tangent kernel, and context-length extension tricks (seen this in YaRN, Arcee's latest model, and here).

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

text, pre-training, post-training

implementation. minimax is a MoE with a lightning attention mechanism, a linear-time variant inserted after every seven standard transformer blocks. this allows for huge context lengths and long CoTs while saving FLOPs, while remaining competitive with open-weight models. also note that the RL cost half a million and they used a new RL objective, CISPO, which clips the trajectory-level ratios rather than token-level ratios like in PPO.

RLVR Implicitly Incentivizes Correct Reasoning in LLMs

text, post-training

idea: this paper refutes a claim that goes something like 'RLVR doesn't create new reasoning ability, only biasing the model toward already-known solutions,' proposed in a previous paper using Pass@K. this paper argues that the Pass@K measure isn't a great proxy for reasoning quality - think about lucky answers with wrong processes.

implementation. this paper uses a LLM judge and makes a new metric, CoT-Pass@K, to grade the chain of thought as well. RLVR does well wrt to this metric and doesn't show the plateau that Pass@K does.

Reasoning with Exploration: An Entropy Perspective

text, post-training

idea. tokens crucial for reasoning (wait, hold on, what if) are high-entropy. so let's include entropy in the objective to reinforce it.

implementation. introduce an entropy term in the objective (GRPO). it's clipped and is scaled. this change boosts Pass@K (even at large K).

Dense SAE Latents are Features, not Bugs

text, mech interp

idea. many sparse autoencoder latents fire on 10-50 percent of tokens: they are dense and encode multiple layers of meaning. this is intrinsic, not noise.

implementation. (lots of mech interp stuff I didn't truly parse through). but a lot of these dense latents have real purpose - some are positional trackers, context-binding pairs, null-space latents, alphabet-prefix logits boosters, part-of-speech detectors, and principal-component reconstructions. some are still unexplained, but these latents do have real purpose.

comments. mech interp is hard to make causal.

Leaky Thoughts: Large Reasoning Models are Not Private Thinkers

text, privacy

idea. language models are not very good at following system prompts

thoughts

RL still seems to be the hot topic; better metrics for RLVR, LLM-as-a-judge, new objectives (with entropy added in or a different expression to clip). the learned tokenizer idea is super promising - there's a trend that letting the model learn what works best for itself will (eventually) outperform frozen methods, but there's more work to be done here (different corpus? more layers?). diffusion/multimodal remains entrenched in mystery, at least for me. but overall, solid findings (and breakneck pace)!

enbao cao