September Paper Reading

09-07

TL;DR - a method for improving prompts, agents have world models, performance and efficiency gains on MoE, making a synthetic data pipeline

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Genetic-Pareto (GEPA) is a method for optimizing prompts. GEPA keeps a pool of candidate prompts and iteratively chooses to mutate by reflecting on the prompt and the feedback or crossover by merging two prompts in the pool.

The Pareto part is the way GEPA selects which prompts to keep for the next generation. GEPA grade candidates by evaluating them on a set of training instances. If no other candidate dominates this candidate (does as well if not strictly better on every task), it is on the Pareto frontier and kept in the pool. Using a frontier rather than an absolute best maintains diversity and avoids premature collapse onto a single winner.

This approach shines when dealing with compound AI systems, which the paper defines as 'any modular system composed of one or more language model (LLM) invocations.' To see what the team means by that, let's look at some of the benchmarks they use:

  • HotpotQA is multi-hop Wikipedia QA, where the questions need a synthesis of evidence from multiple Wikipedia articles. The typical LLM workflow would thus look like (1) querying a Wikipedia article, (2) processing and reflecting, (3) querying another Wikipedia article, and repeat.
  • HoVer is multi-hop Wikipedia claim verification. Similar to HotpotQA, but the model is gathering articles to verify/refute a claim instead.

GEPA outperforms the leading prompt optimizer, MIPROv2, by over 10% on both Qwen3 8B and GPT 4.1 mini. At this point, we might be so familiar with GEPA that it's hard to imagine any other prompt optimization approach. So, a natural question to ask: what does MIPROv2 do?

MIPROv2 evaluates the model on the benchmark, keeps already-correct responses, and uses them as few-shot examples for the future. This might seem too exploitative! To incentivize exploration, MIPROv2 also uses an 'instruction proposer' to modify the search space, adding in dataset summaries, program-code summaries, and tips for more diverse instructions.

If the task at hand yields meaningful textual feedback that the model can reflect upon, GEPA is likely the way to go.

General agents contain world models

Viewing an LLM as a goal-conditioned deterministic policy in a fully observable controlled Markov process (the RL lens), this paper proves that the policy alone is enough to determine a predictive world model, which is defined as a one-step transition probability estimator. A bound is then established between this estimate and the actual transition probabilities.

I won't go into the proof here, but I encourage everyone to derive it for themselves! There's a really great idea here about turning a trajectory into a sequence of Bernoulli trials and then extracting a bound from there.

LongCat Technical Report

Thank you, Meituan, for takeout and language models!

The two big ideas that the team incorporates into LongCat-Flash both improve MoE.

The first idea is zero-computation experts, which comes from the 2024 MoE++ paper. These are extra experts that serve as identity mappings; they just pass the inputs through. Including these lets the model vary how many real experts it wants to activate, allowing for dynamic compute.

The second idea is Shortcut-connected MoE (ScMoE), from a 2025 paper of a similar name. ScMoE is a scheduling change that overlaps the current MoE layer's communication and the preceding block's FFN computation. LongCat uses this to boost efficiency.

There's also smaller yet important tweaks. See the new scaling factors for MLA, variance compensation for expert initialization, and hidden z-loss to prevent massive activations.

Hermes 4 Technical Report

Hermes 4 is a family of post-trained models. The team uses Llama 3.1 as the base for the 70B and 405B variants and Qwen3-14B as the base for the 14B model. The model is great at instruction-following, creative expression (EQBench and CreativeWriting) with a lot fewer refusals, while remaining strong at math and decent at coding.

What I found most interesting about Hermes 4 is the data synthesis and curation pipeline, and specifically what data they choose to incorporate. I really like this idea of simulating humans with synthetic personas so the model learns to be familiar with a wide range of user personalities.

Overall, the Hermes post-training pipeline uses LLMs to write prompts, answer them, and grade them, creating quality synthetic data. On top of being a solid model, Hermes 4 is a great proof of concept of harnessing LLMs end-to-end.