Mid August Paper Reading

08-16

So far in August, we have seen more on the product side than the (open) research side, including GPT5, gpt-oss, 2.5 Deep Think, and Opus 4.1. There are, however, still papers to be read and understood.

GLM 4.5 Technical Report

GLM 4.5 was trained with the goal of becoming proficient at Agentic (interacting with tools and the world) tasks, Reasoning (multistep math & science), and Coding (real world software engineering tasks). Here are the current major benchmarks in each domain.

Agentic:

  • TAU Bench - LM must chat with the user, call domain APIs, obey policy rules (i.e. airline or retail). Evaluated by checking the final database state, pass@k. For example, a LM might be instructed to change a flight and need to call an API to actually do so.
  • BFCL (Berkeley Function-Calling Leaderboard) - tests multi-turn and multi-step tool use across domains like file systems, trading, vehicle control, and travel booking. LM is penalized for not using tools.
  • BrowseComp - consists of hard to find, easy to verify questions that the LM must use search for. More complicated than retrieval, as LM must have a smart search strategy and rigorously check the results. Example: finding a specific NLP paper given the institutions of the authors.

Reasoning:

  • GPQA (Graduate-level Google-Proof Q&A) - hard questions in biology, physics, and chemistry.
  • LiveCodeBench - tests code generation, code repair, execution, test-output prediction on LeetCode, AtCoder, and Codeforces-adjacent problems.
  • HLE (Humanity's Last Exam) - frontier-level questions in many domains.

Coding:

  • SWE-Bench Verified - given a repository and an issue, the LM must fix the issue without breaking pre-existing tests.
  • Terminal-Bench - what the name implies.

Architecture-wise, GLM 4.5 is a MoE using loss-free balance routing and sigmoid gates (more on this potentially in a future blog, but this is to improve load balancing among experts). Whereas Kimi K2 increased the number of experts, GLM decreases the number of experts and hidden dimension (less wide) and added more dense and MoE layers (more deep). Also, the team used 2.5x the default amount of attention heads, as they noticed that this consistently improves scores on reasoning benchmarks.

The team uses Muon, like K2! We see a warmup, which is typical for Muon. Read more about popular optimizers in my previous blogpost here.

Pre-training and mid-training are pretty standard, but do note that entries from all data sources (webpages, github code, math and science documents, books, papers) was assigned a learned quality score à la Nemotron and higher quality data was up-sampled (up to 3.2 epochs for the best quality). We see this emphasis on Agentic, Coding, and Reasoning (ARC) again as the team trains on long context tasks (Agentic), repository-level code (multiple files, pull requests, commits, related issues, PRs, commits) (Coding), and synthetic reasoning traces (Reasoning) during mid-training.

In post-training, we see some cold start SFT and then RL with the GRPO objective but without the KL divergence term, à la Magistral and some other models. The team uses two stages of difficulty (moderate to extreme), where extreme difficulty is defined as when pass@8 is 0 but pass@512 is nonzero, and also changes sampling temperature during RL for healthy exploration.

GLM 4.5 once again reminds me of that ludwig tweet, where Chinese labs create (yet another) SOTA open-source model just by curating high quality data, generated good data, borrowed techniques that saw success in the past (Muon, GRPO without KL, selecting for difficulty and changing temperature during RL in this case), and had some small-scale new ideas (add more layers). Achieving near-SOTA without a fundamental paradigm shift might make GLM 4.5 all the more impressive.

Persona Vectors

The big idea here is we can prompt an LLM to express a trait (i.e. optimistic) and look at its residual stream (see my intro on mechanistic interpretability for a better understanding), usually at the layer before the last prompt token. We can also prompt it to not express this trait. Then, we compute a 'persona vector' by taking the (meaned) difference of the trait-expressing residual stream and the non-trait-expressing residual stream. We can then add this persona vector, almost like a direction, onto another model later to induce that trait.

This is a long paper with a bunch of experiments, observations, and results, but the ones that stood out to me are:

1) you can subtract the 'evil' vector or the 'sycophancy' vector to suppress the trait during inference, but too much suppression can impact general ability on benchmarks. adding the vector during training prevents the model from acquiring the trait without the performance hit.

2) it's more effective than instructions in the prompt.

I like the technique, and the results are powerful. Great paper overall.

Related posts