Justy: Okay so — what if you trained your memory ONCE, and then just... swapped in a smarter brain whenever a better model dropped? That's basically what this MIT paper is claiming.
Cody: Which sounds too clean. But the numbers are actually interesting, so — yeah, let's get into it.
Justy: Also I'm running on like five hours of sleep, I drove up this morning and traffic was unhinged. I'm not even fully here yet.
Cody: I could tell. You texted me from the freeway, which, please don't do that. Anyway — MeMo.
Justy: Okay, MeMo. So the thing it's solving — LLMs are frozen after training, right? Their internal knowledge just... stops. And if you want to update it, you're either doing full retraining, which is wildly expensive, or you're doing RAG, which has its own mess of problems.
Cody: Right.
Cody: And the RAG problems are real. One of the co-authors said something I thought was pretty sharp — that vector databases have a fundamentally difficult job encoding the full semantics of a chunk of text in a single vector. Because relevance sometimes only becomes apparent when you look at multiple chunks together. So you end up retrieving the wrong stuff, or just... missing the point of the query entirely.
Justy: Mm-hm.
Cody: And that noise sensitivity is a real problem in enterprise deployments. Your knowledge base is a disaster — duplicate files, outdated policies, stuff that should've been deleted two years ago.
Justy: Every company's internal wiki is just a graveyard of good intentions. So MeMo's answer is — don't retrieve from raw documents at all. You encode the knowledge into a separate small model instead.
Cody: Yeah, and the architecture is actually clever. There are two main pieces. The Memory model is a smaller language model — they used Qwen two-point-five fourteen billion in the experiments — and it gets fine-tuned specifically to hold the new knowledge in its weights. Then there's the Executive model, which is your big frozen LLM, the reasoning engine. It never gets touched.
Justy: Oh interesting.
Cody: At inference time, the Executive decomposes a user's question into atomic sub-questions, fires them at the Memory model like API calls basically, gets the facts back, and synthesizes the answer. It's treating the Memory model as an oracle.
Justy: And the thing that makes that work is what they call reflections — which I love as a name, honestly. Instead of just dumping raw documents into training, a Generator model first converts everything into targeted question-answer pairs. Every angle of the corpus, captured as QA pairs. THEN the Memory model trains on that.
Cody: Which is doing real work. You're not just compressing text — you're forcing the model to internalize the knowledge in a queryable form. That's why it holds up under noise. The Executive never sees the messy raw docs.
Justy: Right, right.
Justy: And the benchmark results are kind of wild, Cody. On NarrativeQA — which is a long-document multi-hop reasoning benchmark — MeMo hit 53.58% accuracy paired with Gemini 3 Flash. HippoRAG2, which is a state-of-the-art graph-based RAG system, maxed out at 23.21%.
Cody: That gap is LARGE. Like, that's not a marginal improvement — that's a different regime of performance on the same task.
Justy: And then the swap thing — they just switched the Executive model from Qwen to Gemini 3 Flash and got a 26.73% jump on NarrativeQA. No retraining. The Memory model didn't change at all.
Cody: Yeah, that's the part that's actually interesting to me from a systems standpoint. The memory artifact is decoupled from the reasoning engine. So as frontier models improve, you just... plug in the new one. Your private data never has to leave your infrastructure to get fine-tuned into a closed model.
Justy: Which is the enterprise pitch right there. Regulated industries, legal, healthcare — anywhere you can't just send your docs off to a third-party API for fine-tuning.
Cody: Sure. Though I want to be honest about the trade-offs, because they're real. Continual updates — like when new documents come in — use model merging instead of full retraining. They derive a task vector from the new data and mathematically merge it into the existing Memory model weights. It's much cheaper. But the paper says you take an eleven to nineteen percent accuracy hit compared to a full retrain.
Justy: Hm.
Justy: Is that a dealbreaker though? For most enterprise use cases — like, you're not doing perfect retrieval anyway. An eleven percent hit off a baseline that already doubled HippoRAG2 is still probably fine.
Cody: Fair. My other flag is the upfront cost. Generating the reflections — converting your whole corpus into QA pairs with a thirty-two billion parameter Generator model — that's not free. For a huge knowledge base that's a meaningful compute bill before you've even trained the Memory model.
Justy: One-time cost though, presumably. And then updates are cheap.
Cody: Presumably. I'd want to see how the reflection generation scales on a corpus that's actually enterprise-sized — like, hundreds of thousands of documents, not research benchmark scale. But yeah, the architecture is genuinely novel. I'm not just being a pessimist here.
Justy: You said that so defensively.
Cody: I'm growing.
Justy: Okay — this was episode four forty three of Exploring Next, which, how are we at four forty three. That's unhinged. Get some rest, Cody, and I'll talk to you soon.
Cody: Drive safe this time.