Justy: Okay, so they’re saying the whole point of experience internalization—turning past interactions into actual model weights—just falls apart the second you try to do it more than once. Cody: Yeah, and the graph in Figure one is brutal. Justy: Right… Cody: It’s not a gentle degradation. It’s a cliff. Second iteration, third iteration, the curve just nose-dives. Justy: And the culprit’s on-policy context-distillation? Cody: Exactly. Because you’re correcting mistakes the student made, which means you’re teaching it to recover from its own bad states instead of learning the good path. Justy: So every iteration bakes in more of the wrong distribution. Cody: Bingo. It’s like trying to teach someone to drive by only ever showing them how to swerve back from the shoulder. Justy: God, that is SUCH an Exploring Next take. Cody: I mean it’s accurate though. Justy: Fine, fine. Anyway — this thing landed while I was in San Francisco last week, and honestly it was nice to have something to read on the flight that wasn’t another product requirements doc. Cody: How was the trip? Justy: Exhausting. Four meetings a day, all of them could’ve been Slack messages. But I did get to see the new office space, which is actually kind of cool. Cody: Nice. Anyway — back to the paper. The fix is almost stupidly simple once they lay it out. Justy: Principle-level experience over instance-level, step-wise injection, off-policy distillation. Cody: Right. So instead of memorizing exact conversation snippets, you distill the underlying rules or strategies. Justy: Mm-hm. Cody: Then you don’t dump all that distilled knowledge at the start of the prompt. You inject it at the exact decision point where it’s needed, like a just-in-time cheat sheet. Justy: And the off-policy part means you’re learning from clean teacher trajectories instead of the student’s messy ones. Cody: Yeah. Stable signal, no drift. The whole pipeline suddenly compounds instead of collapsing. Justy: Okay, so who’s actually going to build with this? Cody: Honestly? Probably no one ships it tomorrow. The experiments are still in controlled agent environments, not production chatbots. Justy: But the recipe’s concrete. Principle granularity, step-wise injection, off-policy distillation. That’s not hand-wavy. Cody: No, but the failure mode they’re solving is super real. Anyone doing multi-shot fine-tuning without this is basically playing Russian roulette with model decay. Justy: So you’re saying this is a must-read for teams iterating on agentic workflows? Cody: At minimum. And the repo’s public — RUCBM slash ExpInternalization on GitHub. Justy: Nice. I’ll get the link from you later. Cody: You’ll copy it from the show notes like everyone else. Justy: Fair. Cody: Anyway, trade-offs. Off-policy distillation means you need high-quality teacher data up front. If your teacher trajectories are garbage, you’re still sunk. Justy: Right. And principle-level abstraction sounds great until you realize someone has to define what a principle even is. Cody: Mm. They sidestep that by using automatic extraction from the teacher’s own outputs, but it’s still a hidden cost. Justy: Still, the fact that they turned a collapse into compounding improvement with three tweaks is… wild. Cody: Spoken like a true product optimist. Cody: I’ll take it. Justy: Alright, I’m grabbing that repo link. Try not to find three new problems with it before then.