early august paper reading
TL;DR: kimi k2 has cool techniques, reasoning length doesn't necessarily boost accuracy, system prompts and iterations are powerful, we can treat low-entropy and high-entropy terms differently
kimi k2 technical report
Kimi K2 is a 1T (32B active) Mixture-of-Experts model trained on 15.5 trillion tokens, notably without loss spikes. It uses architecture akin to Deepseek v3, but with more experts (256 to 384) and less attention heads (128 to 64). Below are some details I thought were worth noting about K2.
MuonClip: The Moonshot team has previously shown that Muon outperforms AdamW when compute- and hyperparameter-matched. The key challenge with Muon, however, is that exploding attention logits lead to training instability. To scale up Muon, the team proposes a QK-Clip, which rescales the weights for query and key for particular heads so that the projections do not explode. The team refers to the combination of Muon and QK-Clip as MuonClip, the optimizer behind the beautiful loss graph in both smaller experiments and the full training run for K2. Admittedly, I know less about optimizers than I'd ought to, and I might look into how Muon and AdamW work as part of a future blogpost.
Data Rephrasing: To maximize the usefulness of high quality data, the team generates synthetic rephrased versions using LLMs, specifically ensuring that these rephrasings have diverse styles and perspectives. Training on rephrasings is intuitively better than training on the same excerpt multiple times, as the latter can lead to overfitting, and is also empirically superior, as shown with a small-scale experiment in the tech report. Note: this technique was also present in previous research, notably in this 2024 paper.
Sparsity Scaling Law: To determine optimal hyperparameters, the team develops a scaling law for sparsity := the ratio of total experts to the number of activated experts. (MoE is also a potential direction for a future post!) Fixing the number of activated experts to 8 and varying the total number of experts, the team finds that sparser MoEs achieve the same validation loss with less compute. The tradeoff with sparsity, as stated in the paper, is really between model performance and infrastructure complexity GPU-wise. With more total experts means more allocating them to GPUs, storing, checkpointing, and restoring extra parameters, and makes throughput a lot harder to optimize. In fact, the team even writes that "the remaining GPU memory on each device is insufficient to hold the full MoE activations," making their choice of 384 experts over some larger amount very reasonable.
Self-Critique Rubric Reward: Along with ground-truth reward signals (RLVR), the team uses a Self-Critique Reward mechanism where the model evalutes its own outputs. The team trains a critic K2 by using three types of rubrics: core rubrics, prescriptive rubrics, and human-annotated rubrics. Core rubrics outline K2's values: clarity and relevance of answer, conversational fluency and engagement with the topic, and an objective and grounded tone (no meta-commentary). Prescriptive rubrics try to stamp out reward hacking by penalizing compliments (sycophantic) and explicit justification of why its response is good. While effective, these rubrics also favor direct, decisive answers over rightfully impartial or cautious ones.
Like Qwen, the Kimi team shows that curating great data, making quality synthetic data from it, recreating if not innovating upon what worked for other labs, and solid fundamentals all across the board (pretraining, data, infra, post-training) can get SOTA. Actually incredible work.
inverse scaling in ttc
Observation: Across models (Opus 4, o3, R1), the team observes an inverse relationship between test-time compute (reasoning length) and accuracy, at least on four different types of tasks. This happens both when the reasoning length is controlled (prompting 'think harder' or 'don't think') and naturally.
This inverse scaling law is seen most clearly with the Misleading Math dataset (math problems with distractors) and Zebra Puzzles (logic puzzles), and somewhat emergent with other types of problems (Misleading Python, Misleading Math Famous Paradoxes, Grades Regression). Sometimes, the scaling law appears when models reason naturally but not when they are controlled (Zebra puzzles). Other times, the scaling law is seen in both modes (Opus 4, Misleading Python). Or neither (o3, Misleading Math Famous Paradoxes).
I think the inconsistency of this scaling law is promising. One of the biggest appeals of test-time compute is that it allows models to backtrack, rethink, and re-evaluate before coming to a final answer. This is the behavior that counters LeCun's argument (if a LLM has probability p of being wrong at each step t, then the total probability of success decreases exponentially with generation length). Hopefully, there is some new method that allows the model to think, but not overthink.
reasoning or memorization
Context: Previous papers reveal that Qwen, even with incorrect or random rewards during the RL stage, sees improved performance on math benchmarks. The same is not true for Llama.
Claim: Qwen has likely seen the problems on these benchmarks before in pretraining, while Llama hasn't.
Methodology: Make a clean synthetic dataset for math problems. Show that only with accurate reward signals do Qwen and Llama exceed their performance ceilings.
gemini 2.5 capable of winning gold at imo
This paper reproduces GDM's IMO Gold feat, showing that Gemini 2.5 Pro can solve 5/6 problems with nothing more than system prompts and a verification loop.
Step 1: The Student model generates N solutions. The team observes that these initial attempts tend to use up the entire thinking budget (32768 tokens) and are of lesser quality, as expected.
Step 2: Self-improvement. The Student is prompted to improve its work. One concrete reason why this step helps is that it effectively doubles the thinking budget, a bottleneck in Step 1.
Step 3: A Verifier model critiques the solutions and writes a report enumerating errors, categorizing them into critical errors (demonstratively false statements and logical fallacies) and justification gaps (major or minor).
Step 4: The Verifier is prompted to improve its report.
Step 5: The Student is given the report and improves upon its solution.
Steps 3 to 5 are repeated until the Student and Verifier converge on a valid solution.
The team uses specific system prompts for the Student and the Verifier. The Student prompt emphasizes rigor above all else, honesty (don't make things up), and encourages intermediary results. The Verifier prompt emphasizes rigor and spells out the distinction between critical errors and justification gaps.
One little detail in the paper is that the team explicitly prompted the model to use induction for P1 and algebraic geometry for P2. Some might argue that these hints are unfair, but the team argues that this just lowers Gemini's computation time. I think the more math competitions you do, the more valid the team's reasoning is. Induction and algebraic geometry are broad enough approaches that in a competition scenario, you're bound to try them eventually if they were not your first guess already. Optimally, it's great if Gemini could solve it without assistance, but the researchers don't have lab/institutional backing for that sweet compute, so I empathize.
dual-token constraints for RLVR
Idea (a really nice one): There are times when the model should explore more and times when the model should stick to what it knows. Specifically, the model should explore when it reasons but be consistent when recalling factual knowledge.
Implementation: Calculate entropy for every token in the response. If the token entropy is above 80th percentile, classify it as a high-entropy reasoning token and use higher clip thresholds and weaker KL regularization (allowing exploration). Otherwise, classify it as a low-entropy knowledge token and use lower clip thresholds and stronger KL regularization (preserve facts).
The model, Archer, achieves SOTA for its 1.5B size on math and coding benchmarks, which is really nice.
shortlist:
- Mixture-of-Recursions
- Scaling Laws for Optimal Data Mixtures
- Training Transformers with Enforced Lipschitz Constants
- How Many Instructions Can LLMs Follow at Once?
- Language Models Improve When Pretraining Data Matches Target Tasks
- Your LLM Knows the Future: Uncovering its Multi-Token Prediction Potential
- REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
- The Devil Behind the Mask: An Emergent Safety Vulnerability of Diffusion LLMs
- Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
- Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
- Diffusion Beats Autoregressive in Data-Constrained Settings
- Rubrics as Rewards
- Deep Researcher with Test-Time Diffusion
- Learning without training
- Beyond Binary Rewards
- AlphaGo Moment for Model Architecture Discovery
- Checklists Are Better Than Reward Models For Aligning Language Models
- The Invisible Leash of RLVR
- Subliminal Learning
- Learning without Training
- Deep Researcher with Test-Time Diffusion