enbao.me

early june paper reading

this will be less formal because these are my notes

XX^T can be faster

tldr: they apply a simplified alphatensor algorithm on a specific matrix multiplication (X times its transpose) with additional structure. an rl-guided large neighborhood search found a constant-factor (constant because matmul is inherently quadratic) improvement.

the underlying theory: the matmul product can be reconstructed using bilinear forms, but the search space for the least number of needed bilinear forms is huge. as for methodology, it's worth reading alphatensor.

eval: tested on a 6144 by 6144 matrix on hardware, beating prev SOTA both theoretically and experimentally.

the novel contrib is mostly realizing that for specific matmuls, extra structure can result in further optimizations. however, there are few matmuls that (1) have useful extra structure and (2) are regularly used in training pipelines. in this case, XXT is used in PCA, linear regression, kernel methods, and optimizers like shampoo, soap, and muon, so this is a crucial optimization, even if by a single-digit percentage.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models.

tldr: settles the myth that RL cannot uncover novel reasoning strategies, only bring out pre-existing methods.

the team does this by evaluating on tasks (ood) where the base model fails entirely, regardless of number of attempts and sampling methods.

lots of math (this is nemotron). after skimming: start with grpo, but grpo causes entropy collapse (policy prematurely commits to narrow set of outputs). instead of naively increasing temperature during rollouts, the team uses DAPO (look into later), which modifies the clipping bounds in PPO (look into later). also, they introduce a KL penalty to help.

Surprisingly Fast AI-Generated Kernels We Didn't Mean to Publish

scaling intelligence lab ft. prime intellect. i kneel

tldr: search and branching in natural language can result in better kernels than iterative kernel edits (the latter falls into local minima, and there's little incentive for fundamentally new designs). genetic in concept.

I don't know much about GPUs, so my takeaways are gonna be much more fundamental than the specific things that this paper has to offer.

things that are important to optimize for kernels: efficiency of data movement (memory access optimization), overlaps slow operations with computation (latency hiding), lower-precision data types, reducing instruction count, leveraging specialized hardware, maximize active warps (parallelism), reduce control flow overhead.

concluding thoughts:

on rl, it seems like we're approaching a plateau - many different methods and variants for post-training all show marginal improvements on the model.

on gpu and low level optimizations, the two papers both point towards searching and branching (either based on rl or genetic) as being a promising method for generating new, more efficient computations and kernels, rather than manual exploration. this feels similar to what's currently going on in mech interp: the original process is manual and highly tedious, and circuit discovery is largely automated (anthrop papers are coming up soon)