(Please visit the permanent blog address: https://rdi.berkeley.edu/blog/rl-grokking-recipe)

Yiyou Sun¹, Yuhan Cao, Pohao Huang¹, Haoyue Bai², Hannaneh Hajishirzi³⁴, Nouha Dziri⁴♠, Dawn Song¹♠

¹ University of California, Berkeley · ² University of Wisconsin–Madison · ³ University of Washington · ⁴ AI2 (♠ indicates equal advising)

<aside> 💡

Question: Can reinforcement learning (RL) actually teach large language models new algorithms—or does it only “sharpen” what’s already latent in the base model?

Recent analyses say RL stays on a leash: pass@1 goes up, but what’s possible at large sampling (e.g., pass@128) doesn’t expand. We set out to test this directly, and our answer is: RL can discover something new, but only when trained wisely.

</aside>

Paper link: https://www.arxiv.org/abs/2509.21016

(Est. 3-5 minutes read)


TL;DR


Manufactoria: a pure OOD learnability testbed

Manufactoria is an old Flash game from 2010 where you test robots by reading colored tapes. We took that core idea and turned it into a clean, programmable playground for studying learnability. Some problems are so challenging that even advanced LLMs like GPT-5 would have a 0% success rate!

image.png

Instead of a 2D puzzle grid, we expose a minimal program syntax with just two primitive “machines”: a puller (reads/moves) and a painter (writes/marks). Think of it as a tweaked Turing machine where the reader is only allowed to operate on the left side of the tape and the writer is only allowed to operate on the right.

Why this is truly out-of-distribution (OOD)