mid july paper reading

given the influx of new papers, I've decided to focus on papers that deal with transformers + text. that being said, I'm sure there's really cool research being done outside of that realm as well, so I'll include a list of other interesting papers at the end.

this post will resemble my notes, so expect slightly less coherent formatting.

TL;DR

tired: cot analysis

wired: RL, RL environments, LLM-as-a-judge RL, benchmarks that evaluate self-improvement

Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

this was from a while ago but still interesting because I just know Seed is waiting to release some crazy good models. Seed 1.5 gaps DeepSeek-R1 (not 05-28. this was in April) on a variety of benchmarks.

what exactly makes this model so good? great stem (clean and quality) and non-stem data (high variance, creative and dialogue), SFT on CoT to get reasoning, RLVR with judges, open-ended RL with a pairwise generative reward model, PPO with clip-higher variant (VAPO for actor-critic and DAPO for no critic) for stability. cracked.

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

cool idea for a benchmark! given a training script for nanogpt, see if an LLM can make it faster, either through algorithmic improvements or hardware optimizations. the model is tested with and without extra hints (like pseudocode or textual description). o3-mini does the best. I want to see o3-pro do it though.

Generalist Reward Models: Found inside Large Language Models

theoretical proof that yes, LLM-as-a-judge works.

old method (LLM-as-a-judge): generate answer(s) from LLM with prompt. give a system prompt to judge LLM and score the answer(s)

the new method (endogenous reward): generate answer(s) from LLM with prompt. feed prompt + answer into judge LLM. directly calculate the answer's reward from the tokens and logits (using an inverse soft-bellman operator, see paper). a bit more complicated and mathematical (anything is more complicated than a simple system prompt), but it outperforms LLM-as-a-judge by a marked amount (10-20 percent), which is definitely worth it

The Trilemma of Truth in Large Language Models

training a SVM to see if LLM thinks statement is (1) true, (2) false, or (3) neither. I thought the three-way split was rather arbitrary (not canonical enough to justify calling it trilemma in the title), but it's a pragmatic choice. there are many statement without a truth value (questions, orders, or language model doesn't know what's right).

also, the SVM neglects the order of the tokens in the sentences, which is seems crucial. for example: 'San Francisco is in California' and 'California is in San Francisco' are treated identically. this is noted as a issue to work in the future, which is respectable.

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

AGI through playing games.

training qwen3-4b on games like kuhn poker, tic tac toe, and simple negotiation enhances performance on math benchmarks.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

SFT and RL on math improves scores on math reasoning benchmarks, RL more than SFT. SFT and RL on math improves scores on other reasoning benchmarks, roughly matching in effectiveness. SFT on math decreases scores on non-reasoning benchmarks. RL on math increases scores on reasoning benchmarks.

RL is cracked. No wonder XAI spent half their budget on Grok 4 RL.

Fast and Simplex: 2-Simplical Attention in Triton

attention is pairwise, 2-simplical attention is triplet-wise. the team implements 2-simplical attention with sliding windows to avoid the nasty O(N cubed). a model where one in four layers uses 2-simplical heads beats a same-sized transformers on benchmarks with a fixed token budget.

theoretically, n-simplical attention is the theoretical best for capturing every possible higher order relation, but that would explode today's workloads. 2-simplical is a sweet spot (it might be a bit much compute already today, actually).

Other Papers

I found "Test-Time Scaling with Reflective Generative Model" confusing. Not super well written (grammatical mistakes, typos, overall lack of clarity). Regarding "Chain-of-Thought is Not Explainability," I feel like anyone who has seen R1's multilingual multisymbolic chain of thought for doing a math problem sees that CoT isn't necessarily coherent, and if even it is, it doesn't really work like a window into the LM's inner workings.