Justy: What if the biggest bottleneck in multi-agent AI isn't the models — it's the fact that they're all just... texting each other.
Justy: Welcome to Exploring Next, episode 345. We're talking about a paper called RecursiveMAS — and honestly, Cody, when you sent this over I had to read the abstract twice.
Cody: Yeah it's one of those where the headline number — 8.3% average accuracy gain across nine benchmarks — almost undersells what's actually going on under the hood. The token reduction is the part that stopped me. Up to 75.6% fewer tokens used. That's not a rounding error.
Justy: And this is coming out of UIUC, Stanford, NVIDIA, MIT — so it's not a one-lab thing. What's the core problem they're solving?
Cody: So the standard way multi-agent systems work today: one agent finishes, writes out its answer in text, the next agent reads that text and continues. Every handoff is a full decode-then-re-encode cycle. That's slow, it's expensive, and when you try to train the whole system end-to-end, gradients basically vanish because text is a discrete bottleneck — you can't backpropagate through words.
Justy: Right, so you're burning tokens just to carry information from one model to the next. It's like... printing an email, handing someone the paper, they retype it.
Cody: Exactly. And their fix is — skip the text entirely for intermediate steps. Keep everything in continuous latent space. The key piece is a module called RecursiveLink. It's actually pretty small — a two-layer residual projection. Not a whole model, just a lightweight learned bridge.
Justy: So it's not replacing the agents, it's just changing how they pass information between each other.
Cody: Right. And there are two versions of it. An inner RecursiveLink sits inside each agent and helps it consolidate its own latent thoughts during generation — the paper calls these the model's 'ongoing latent thoughts,' which is a nice way to put it. Then an outer RecursiveLink bridges hidden representations across agents that might be completely different model families, different sizes. So you could have a Qwen3 agent handing a latent state to a LLaMA-3 agent and the outer lin
Justy: That's the part I keep coming back to — heterogeneous agents. Because in practice, nobody's running a system where every model is identical. You've got specialized tools, different fine-tunes. The fact that this works across model families is actually a big deal for real deployments.
Cody: And critically — only the last agent in the last recursion round ever produces text output. Everything before that stays in latent space. That's where the token savings come from. You're not generating and re-ingesting intermediate text at every step.
Justy: Okay so training this thing — how does that work? Because training a system that's looping through itself sounds like it could get messy fast.
Cody: They handle it with what they call an inner-outer loop training algorithm. The inner loop first warms up each agent individually — trains its inner RecursiveLink to get comfortable with latent thought generation. Then the outer loop kicks in and trains the cross-agent connections, with gradients flowing back through the full recursion trace. So every agent sees feedback not just from its own outputs but from what happened downstream. The whole system co-optimizes together.
Justy: They also prove it stays stable — like the gradient vanishing problem that kills text-based training doesn't happen here?
Cody: They provide theoretical analysis on both runtime complexity and learning dynamics, yeah. The latent connections maintain stable gradient flow across recursion rounds. That's one of the things I find credible about the paper — they're not just showing empirical results, they're explaining why it should work mathematically.
Justy: Alright, so — who actually builds with this? I'm thinking about the team that's already running something like a multi-agent pipeline in production. Is this shippable or is it a research artifact right now?
Cody: Honest answer: it's closer to research artifact today. The project page is up at recursivemas.github.io but I haven't seen a fully open training repo yet. That said — the architecture is not exotic. The RecursiveLink is small enough that a team with ML infra could implement it. And the fact that they tested it on four different collaboration patterns — sequential reasoning, mixture-of-experts, expert-to-learner distillation, tool-integrated deliberation — tells me they were t
Justy: The 75% token reduction is the number that would get a finance team or an infra team to actually pay attention. That's not a research metric, that's a cost line.
Cody: True. Though I'd flag — that range is 34.6% to 75.6%. The low end is real but it's less dramatic. Depending on your pipeline, you might land closer to the middle, and then you have to weigh that against the added complexity of running this system.
Justy: What's your honest concern with it?
Cody: Interpretability. When agents are passing text, you can read the intermediate outputs. You can debug. You can see where reasoning went wrong. When everything's in latent space, that trace is gone. You get the final answer and if it's wrong, you're doing a lot more forensic work to figure out why. For regulated domains — medicine is literally one of their benchmarks — that could be a real blocker.
Justy: Yeah that's a fair point. I think for internal tooling or code generation pipelines you'd probably care less. But the moment you're in a domain where someone needs to audit the reasoning chain, losing that intermediate text is a problem. [sighs] It's always something.
Cody: Always. [chuckles] Okay, Build Next.
Justy: Yeah — three things. First, go to recursivemas.github.io and see what's actually released. Project page is live and worth bookmarking even if the full repo isn't open yet. Second, if you're a solo builder and want to feel what latent-passing actually is — grab a small Qwen3 or Gemma3 model, build a two-agent toy loop where you manually pass the last hidden state instead of decoded text, and just observe where things break. You'll learn more in a weekend than reading three mor
Cody: And third — if you're on a team already running LangGraph or a similar orchestration framework, take one of your existing pipelines and actually benchmark your current token costs per task. Because if you don't have that baseline number today, you can't evaluate whether something like this is worth adopting when it does ship publicly.
Justy: Good one. Know your baseline before you chase the headline. Alright — we started by asking whether agents texting each other is the bottleneck. I think the answer is yes, and RecursiveMAS is a real shot at fixing it. Episode 345, done.