Justy: Okay, I take back every nice thing I’ve ever said about my phone’s memory… Cody: Here we go. Justy: No, no — hear me out. I was trying to book a trip yesterday, right? And the chatbot kept forgetting which hotel I’d picked three messages ago. It’s like talking to someone who just blinked and reset. Cody: That’s not memory, that’s a stateless prompt window. Justy: Exactly! And that’s the problem MemTrain’s going after. Long-horizon agents that can actually a thought. Cody: Right. So the paper’s framing it as the difference between cramming the whole conversation into the prompt — which explodes in cost — versus teaching the model to keep a little compressed notebook of what matters. Justy: And that notebook’s the memory state. memory t-minus-one gets fed in with the new input, model writes memory t… Cody: Mm-hm. Justy: But the kicker is, until now, training that notebook required labeled data and RL. Which is why it’s all domain-specific and brittle. Cody: Yeah, and labeled long-horizon memory tasks are a nightmare to collect. You need trajectories where the model to remember something from turn seventeen to turn forty-two, and then you need a human to verify that it did. Justy: Which no one wants to pay for. Cody: Exactly. So MemTrain sidesteps that by using self-supervised tasks on unlabeled Wikipedia. Justy: Okay, I’m listening. How? Cody: Two coupled objectives. First, masked reconstruction: they hide an entity in the text, run the agent through multiple memory-updating rounds, then make it recover the masked entity from the final memory state. Forces the model to keep information that’ll matter later. Justy: And the second? Cody: Intermediate memory recall. Same setup, but now the model has to reconstruct masked historical info using the memory state in the interaction. So it’s not just about the end result — it’s about faithful compression at every step. Justy: So one’s outcome-focused, the other’s process-focused. Clever. Cody: And they jointly optimize both with GRPO, which I assume stands for… some flavor of policy optimization. The paper doesn’t spell it out, but the results speak for themselves. Justy: Seventeen point six seven gain on long-text QA. That’s… not nothing. Cody: Yeah, and it’s model-agnostic. They tested it across a few different LLMs, and the memory improvements transferred to downstream tasks without task-specific fine-tuning. Justy: So who ships this? I’m thinking any agent stack that’s doing multi-turn workflows — customer support, research assistants, even that terrible travel bot from yesterday. Cody: Well, the code’s not linked in the paper, so for now it’s research-only. But the approach is reproducible if you’ve got the compute for the self-supervised pretraining. Justy: Which, knowing you, you’re already calculating how many GPUs that’d take. Cody: I was . Justy: Sure. Anyway — the trade-off here’s the cost of the proxy tasks, right? You’re training on Wikipedia, which is clean, but real-world interactions are messier. Cody: That’s my one push-back, yeah. The masked objectives are a proxy for memory needs. Wikipedia’s not interactive, so the ‘memory’ they’re training on might not map perfectly to, say, a user changing their mind halfway through a conversation. Justy: But it’s still a step forward from ‘here’s a labeled dataset of ten memory-heavy tasks, good luck generalizing.’ Cody: No argument. And the GRPO optimization’s smart — balancing the two objectives so you’re not just overfitting to one. Justy: I do love that they’re using unlabeled data. Feels like the only scalable way to get memory right. Cody: Yeah, and the fact that it’s self-supervised means you could, in theory, keep throwing more unlabeled text at it to improve. Justy: At which point Cody starts sweating about the carbon footprint of training runs… Cody: Oh, come on. It’s a valid concern. Justy: It’s also a very take. Justy: Anyway. No code link, so no Build Next this time. But man, if this works in production, it’s the kind of thing that makes agents feel… I dunno. Less like a chatbot, more like a colleague. Cody: Or at least like a colleague who doesn’t forget your coffee order. Justy: I’ll take it. Safe travels back to D.C., and try not to overthink the GPU math on the flight.