RL Grokking Recipe: How Can We Enable LLMs to Solve Previously Unsolvable Tasks with RL?

(Please visit the permanent blog address: https://rdi.berkeley.edu/blog/rl-grokking-recipe)

Yiyou Sun¹, Yuhan Cao, Pohao Huang¹, Haoyue Bai², Hannaneh Hajishirzi³⁴, Nouha Dziri⁴♠, Dawn Song¹♠

¹ University of California, Berkeley · ² University of Wisconsin–Madison · ³ University of Washington · ⁴ AI2 (♠ indicates equal advising)

<aside> 💡

Question: Can reinforcement learning (RL) actually teach large language models new algorithms—or does it only “sharpen” what’s already latent in the base model?

Recent analyses say RL stays on a leash: pass@1 goes up, but what’s possible at large sampling (e.g., pass@128) doesn’t expand. We set out to test this directly, and our answer is: RL can discover something new, but only when trained wisely.

</aside>

Paper link: https://www.arxiv.org/abs/2509.21016

(Est. 3-5 minutes read)

TL;DR

We introduce DELTA: a controlled suite of synthetic programming families with fully OOD splits and verifiable rewards. DELTA lets us ask two crisp questions: Learnability (can RL solve families where the base model has pass@K=0?) and Transferability (do the learned procedures generalize?)
On several pass@128=0 families, RL exhibits a grokking-like phase transition: after a long near-zero-reward plateau, accuracy snaps to ~100%. That is discovery, not mere sharpening.
A two-phase reward schedule is key: dense per-test rewards to escape the “all-zero” region, then binary full-pass to consolidate exact solutions. Binary-only gets stuck; dense-only hovers at “almost right.” The schedule yields the grokking jump.
Transfer is selective: RL-trained policies recompose programming sub-skills and extrapolate to harder parametric regimes, but struggle on transformative shifts that require new invariants.

Manufactoria: a pure OOD learnability testbed

Manufactoria is an old Flash game from 2010 where you test robots by reading colored tapes. We took that core idea and turned it into a clean, programmable playground for studying learnability. Some problems are so challenging that even advanced LLMs like GPT-5 would have a 0% success rate!

Instead of a 2D puzzle grid, we expose a minimal program syntax with just two primitive “machines”: a puller (reads/moves) and a painter (writes/marks). Think of it as a tweaked Turing machine where the reader is only allowed to operate on the left side of the tape and the writer is only allowed to operate on the right.

Why this is truly out-of-distribution (OOD)

New language, not on the internet. The original game’s solutions lived as screenshots on old forums. Our textual program format is brand-new, so pretrained LLMs haven’t seen it.
Fresh puzzles, not recycled levels. We synthesized new problem families inspired by the mechanics, not copies of any published challenges.
Different reasoning than code-as-usual. With only puller and painter, you don’t get normal control flow or rich data structures. You get finite-state, tape-shuffling tricks: routing, buffering, parity checks, pattern filters—strategies you won’t pick up from standard programming corpora.