mid july paper reading
given the influx of new papers, I've decided to focus on papers that deal with transformers + text. that being said, I'm sure there's really cool research being done outside of that realm as well, so I'll include a list of other interesting papers at the end.
this post will resemble my notes, so expect slightly less coherent formatting.
TL;DR
tired: cot analysis
wired: RL, RL environments, LLM-as-a-judge RL, benchmarks that evaluate self-improvement
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
this was from a while ago but still interesting because I just know Seed is waiting to release some crazy good models. Seed 1.5 gaps DeepSeek-R1 (not 05-28. this was in April) on a variety of benchmarks.
what exactly makes this model so good? great stem (clean and quality) and non-stem data (high variance, creative and dialogue), SFT on CoT to get reasoning, RLVR with judges, open-ended RL with a pairwise generative reward model, PPO with clip-higher variant (VAPO for actor-critic and DAPO for no critic) for stability. cracked.
The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
cool idea for a benchmark! given a training script for nanogpt, see if an LLM can make it faster, either through algorithmic improvements or hardware optimizations. the model is tested with and without extra hints (like pseudocode or textual description). o3-mini does the best. I want to see o3-pro do it though.
Generalist Reward Models: Found inside Large Language Models
theoretical proof that yes, LLM-as-a-judge works.
old method (LLM-as-a-judge): generate answer(s) from LLM with prompt. give a system prompt to judge LLM and score the answer(s)
the new method (endogenous reward): generate answer(s) from LLM with prompt. feed prompt + answer into judge LLM. directly calculate the answer's reward from the tokens and logits (using an inverse soft-bellman operator, see paper). a bit more complicated and mathematical (anything is more complicated than a simple system prompt), but it outperforms LLM-as-a-judge by a marked amount (10-20 percent), which is definitely worth it
The Trilemma of Truth in Large Language Models
training a SVM to see if LLM thinks statement is (1) true, (2) false, or (3) neither. I thought the three-way split was rather arbitrary (not canonical enough to justify calling it trilemma in the title), but it's a pragmatic choice. there are many statement without a truth value (questions, orders, or language model doesn't know what's right).
also, the SVM neglects the order of the tokens in the sentences, which is seems crucial. for example: 'San Francisco is in California' and 'California is in San Francisco' are treated identically. this is noted as a issue to work in the future, which is respectable.
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
AGI through playing games.
training qwen3-4b on games like kuhn poker, tic tac toe, and simple negotiation enhances performance on math benchmarks.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
SFT and RL on math improves scores on math reasoning benchmarks, RL more than SFT. SFT and RL on math improves scores on other reasoning benchmarks, roughly matching in effectiveness. SFT on math decreases scores on non-reasoning benchmarks. RL on math increases scores on reasoning benchmarks.
RL is cracked. No wonder XAI spent half their budget on Grok 4 RL.
NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks
good data (NaturalThoughts) leads to good benchmarks!
Fast and Simplex: 2-Simplical Attention in Triton
attention is pairwise, 2-simplical attention is triplet-wise. the team implements 2-simplical attention with sliding windows to avoid the nasty O(N cubed). a model where one in four layers uses 2-simplical heads beats a same-sized transformers on benchmarks with a fixed token budget.
theoretically, n-simplical attention is the theoretical best for capturing every possible higher order relation, but that would explode today's workloads. 2-simplical is a sweet spot (it might be a bit much compute already today, actually).
Other Papers
I found "Test-Time Scaling with Reflective Generative Model" confusing. Not super well written (grammatical mistakes, typos, overall lack of clarity). Regarding "Chain-of-Thought is Not Explainability," I feel like anyone who has seen R1's multilingual multisymbolic chain of thought for doing a math problem sees that CoT isn't necessarily coherent, and if even it is, it doesn't really work like a window into the LM's inner workings.
Shortlist
xLSTMAD: A Powerful xLSTM-based Method for Anomaly Detection
Survey on Evaluation of LLM-based Agents
Deep Research Agents: A Systematic Examination and Roadmap
AI4Research: A Survey of Artificial Intelligence for Scientific Research
UMA: A Family of Universal Models for Atoms
Small Language Models are the Future of Agentic AI
Transition Matching: Scalable and Flexible Generative Modeling
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation
RoboScape: Physics-informed Embodied World Model
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Unified Multimodal Understanding via Byte-Pair Visual Encoding