early july paper reading
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
text, pre-training
idea. transformers use fixed tokenizers (BPE, SentencePiece) to encode inputs into tokens. instead, we can use a model that learns its own embeddings as it trains. let the model decide how it wants to learn.
implementation. decoder-only transformer but it takes in bytes. the U-shape (this naming is from the U-net for image segmentation) contracts bytes into words into two-word chunks, feeds those through eighteen attention layers, and expands it back to byte length. cross-entropy is on the next byte rather than next token.
comments. when trained on the same DCLM dataset, the three-stage AU-Net matches or slightly beats a BPE transformer on language tasks and wins on character-centric tests, losing in GSM8K and TriviaQA. this makes sense, given the benefits/detriments of a learned embedding over a fixed one: numbers and entities are expressed byte-by-byte, which means that the AU-Net needs more compute than BPE to learn these.
LongLLaDa: Unlocking Long Context Capabilities in Diffusion LLMs
text, pre-training
idea. we can try to use context length extension tricks on diffusion LLMs too. this paper 6xs context length while preserving accuracy and costs with a NTK-style rescaling (neural tangent kernel).
comments. need to read up more on diffusion LLMs, neural tangent kernel, and context-length extension tricks (seen this in YaRN, Arcee's latest model, and here).
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
text, pre-training, post-training
implementation. minimax is a MoE with a lightning attention mechanism, a linear-time variant inserted after every seven standard transformer blocks. this allows for huge context lengths and long CoTs while saving FLOPs, while remaining competitive with open-weight models. also note that the RL cost half a million and they used a new RL objective, CISPO, which clips the trajectory-level ratios rather than token-level ratios like in PPO.
RLVR Implicitly Incentivizes Correct Reasoning in LLMs
text, post-training
idea: this paper refutes a claim that goes something like 'RLVR doesn't create new reasoning ability, only biasing the model toward already-known solutions,' proposed in a previous paper using Pass@K. this paper argues that the Pass@K measure isn't a great proxy for reasoning quality - think about lucky answers with wrong processes.
implementation. this paper uses a LLM judge and makes a new metric, CoT-Pass@K, to grade the chain of thought as well. RLVR does well wrt to this metric and doesn't show the plateau that Pass@K does.
Reasoning with Exploration: An Entropy Perspective
text, post-training
idea. tokens crucial for reasoning (wait, hold on, what if) are high-entropy. so let's include entropy in the objective to reinforce it.
implementation. introduce an entropy term in the objective (GRPO). it's clipped and is scaled. this change boosts Pass@K (even at large K).
Dense SAE Latents are Features, not Bugs
text, mech interp
idea. many sparse autoencoder latents fire on 10-50 percent of tokens: they are dense and encode multiple layers of meaning. this is intrinsic, not noise.
implementation. (lots of mech interp stuff I didn't truly parse through). but a lot of these dense latents have real purpose - some are positional trackers, context-binding pairs, null-space latents, alphabet-prefix logits boosters, part-of-speech detectors, and principal-component reconstructions. some are still unexplained, but these latents do have real purpose.
comments. mech interp is hard to make causal.
Leaky Thoughts: Large Reasoning Models are Not Private Thinkers
text, privacy
idea. language models are not very good at following system prompts
thoughts
RL still seems to be the hot topic; better metrics for RLVR, LLM-as-a-judge, new objectives (with entropy added in or a different expression to clip). the learned tokenizer idea is super promising - there's a trend that letting the model learn what works best for itself will (eventually) outperform frozen methods, but there's more work to be done here (different corpus? more layers?). diffusion/multimodal remains entrenched in mystery, at least for me. but overall, solid findings (and breakneck pace)!