Justy: Okay, I take back every nice thing I’ve ever said about my phone’s memory…
Cody: Here we go.
Justy: No, no — hear me out. I was trying to book a trip yesterday, right? And the chatbot kept forgetting which hotel I’d picked three messages ago. It’s like talking to someone who just blinked and reset.
Cody: That’s not memory, that’s a stateless prompt window.
Justy: Exactly! And that’s the problem MemTrain’s going after. Long-horizon agents that can actually a thought.
Cody: Right. So the paper’s framing it as the difference between cramming the whole conversation into the prompt — which explodes in cost — versus teaching the model to keep a little compressed notebook of what matters.
Justy: And that notebook’s the memory state. memory t-minus-one gets fed in with the new input, model writes memory t…
Cody: Mm-hm.
Justy: But the kicker is, until now, training that notebook required labeled data and RL. Which is why it’s all domain-specific and brittle.
Cody: Yeah, and labeled long-horizon memory tasks are a nightmare to collect. You need trajectories where the model to remember something from turn seventeen to turn forty-two, and then you need a human to verify that it did.
Justy: Which no one wants to pay for.
Cody: Exactly. So MemTrain sidesteps that by using self-supervised tasks on unlabeled Wikipedia.
Justy: Okay, I’m listening. How?
Cody: Two coupled objectives. First, masked reconstruction: they hide an entity in the text, run the agent through multiple memory-updating rounds, then make it recover the masked entity from the final memory state. Forces the model to keep information that’ll matter later.
Justy: And the second?
Cody: Intermediate memory recall. Same setup, but now the model has to reconstruct masked historical info using the memory state in the interaction. So it’s not just about the end result — it’s about faithful compression at every step.
Justy: So one’s outcome-focused, the other’s process-focused. Clever.
Cody: And they jointly optimize both with GRPO, which I assume stands for… some flavor of policy optimization. The paper doesn’t spell it out, but the results speak for themselves.
Justy: Seventeen point six seven gain on long-text QA. That’s… not nothing.
Cody: Yeah, and it’s model-agnostic. They tested it across a few different LLMs, and the memory improvements transferred to downstream tasks without task-specific fine-tuning.
Justy: So who ships this? I’m thinking any agent stack that’s doing multi-turn workflows — customer support, research assistants, even that terrible travel bot from yesterday.
Cody: Well, the code’s not linked in the paper, so for now it’s research-only. But the approach is reproducible if you’ve got the compute for the self-supervised pretraining.
Justy: Which, knowing you, you’re already calculating how many GPUs that’d take.
Cody: I was .
Justy: Sure. Anyway — the trade-off here’s the cost of the proxy tasks, right? You’re training on Wikipedia, which is clean, but real-world interactions are messier.
Cody: That’s my one push-back, yeah. The masked objectives are a proxy for memory needs. Wikipedia’s not interactive, so the ‘memory’ they’re training on might not map perfectly to, say, a user changing their mind halfway through a conversation.
Justy: But it’s still a step forward from ‘here’s a labeled dataset of ten memory-heavy tasks, good luck generalizing.’
Cody: No argument. And the GRPO optimization’s smart — balancing the two objectives so you’re not just overfitting to one.
Justy: I do love that they’re using unlabeled data. Feels like the only scalable way to get memory right.
Cody: Yeah, and the fact that it’s self-supervised means you could, in theory, keep throwing more unlabeled text at it to improve.
Justy: At which point Cody starts sweating about the carbon footprint of training runs…
Cody: Oh, come on. It’s a valid concern.
Justy: It’s also a very take.
Justy: Anyway. No code link, so no Build Next this time. But man, if this works in production, it’s the kind of thing that makes agents feel… I dunno. Less like a chatbot, more like a colleague.
Cody: Or at least like a colleague who doesn’t forget your coffee order.
Justy: I’ll take it. Safe travels back to D.C., and try not to overthink the GPU math on the flight.