Justy: Okay, so they’re saying the whole point of experience internalization—turning past interactions into actual model weights—just falls apart the second you try to do it more than once.
Cody: Yeah, and the graph in Figure one is brutal.
Justy: Right…
Cody: It’s not a gentle degradation. It’s a cliff. Second iteration, third iteration, the curve just nose-dives.
Justy: And the culprit’s on-policy context-distillation?
Cody: Exactly. Because you’re correcting mistakes the student made, which means you’re teaching it to recover from its own bad states instead of learning the good path.
Justy: So every iteration bakes in more of the wrong distribution.
Cody: Bingo. It’s like trying to teach someone to drive by only ever showing them how to swerve back from the shoulder.
Justy: God, that is SUCH an Exploring Next take.
Cody: I mean it’s accurate though.
Justy: Fine, fine. Anyway — this thing landed while I was in San Francisco last week, and honestly it was nice to have something to read on the flight that wasn’t another product requirements doc.
Cody: How was the trip?
Justy: Exhausting. Four meetings a day, all of them could’ve been Slack messages. But I did get to see the new office space, which is actually kind of cool.
Cody: Nice. Anyway — back to the paper. The fix is almost stupidly simple once they lay it out.
Justy: Principle-level experience over instance-level, step-wise injection, off-policy distillation.
Cody: Right. So instead of memorizing exact conversation snippets, you distill the underlying rules or strategies.
Justy: Mm-hm.
Cody: Then you don’t dump all that distilled knowledge at the start of the prompt. You inject it at the exact decision point where it’s needed, like a just-in-time cheat sheet.
Justy: And the off-policy part means you’re learning from clean teacher trajectories instead of the student’s messy ones.
Cody: Yeah. Stable signal, no drift. The whole pipeline suddenly compounds instead of collapsing.
Justy: Okay, so who’s actually going to build with this?
Cody: Honestly? Probably no one ships it tomorrow. The experiments are still in controlled agent environments, not production chatbots.
Justy: But the recipe’s concrete. Principle granularity, step-wise injection, off-policy distillation. That’s not hand-wavy.
Cody: No, but the failure mode they’re solving is super real. Anyone doing multi-shot fine-tuning without this is basically playing Russian roulette with model decay.
Justy: So you’re saying this is a must-read for teams iterating on agentic workflows?
Cody: At minimum. And the repo’s public — RUCBM slash ExpInternalization on GitHub.
Justy: Nice. I’ll get the link from you later.
Cody: You’ll copy it from the show notes like everyone else.
Justy: Fair.
Cody: Anyway, trade-offs. Off-policy distillation means you need high-quality teacher data up front. If your teacher trajectories are garbage, you’re still sunk.
Justy: Right. And principle-level abstraction sounds great until you realize someone has to define what a principle even is.
Cody: Mm. They sidestep that by using automatic extraction from the teacher’s own outputs, but it’s still a hidden cost.
Justy: Still, the fact that they turned a collapse into compounding improvement with three tweaks is… wild.
Cody: Spoken like a true product optimist.
Cody: I’ll take it.
Justy: Alright, I’m grabbing that repo link. Try not to find three new problems with it before then.