Justy: Okay, I just read this thing and my first thought is—holy crap, they’re finally talking about the spiral in agents.
Cody: Mm-hm.
Justy: Like, every time you spin up a sub-agent or call a tool, you’re just dumping more tokens into the mix, and then the next turn has to re-read all of it. It gets ridiculous fast.
Cody: Right. And that’s where this Nemotron 3 Ultra thing slots in—they’re pitching it as the heavy lifter for the steps.
Justy: Frontier reasoning plus five times the throughput? That’s the dream for anyone running long workflows.
Cody: Yeah, if the throughput claim holds. Five times is… a lot.
Justy: Well, they’re leaning on NVFP4 quantization and LatentMoE. Cody, you’ve seen MoE routing before—does LatentMoE actually cut the compute the way they say?
Cody: It’s MoE but with a compressed latent space for the router. So instead of firing a bunch of experts every step, it picks a few and only materializes those. And NVFP4 is four-bit weights with per-channel scales—so yeah, you can fit way more on the same GPU without losing accuracy.
Justy: And the hybrid Mamba-Transformer layers—that’s for the long context, right?
Cody: Mamba handles the sequential part cheaply, Transformer layers kick in when you need the expressivity. It’s a trick to keep long-context inference from blowing up.
Justy: So—this is such an Exploring Next take—if you’re running a multi-agent pipeline and you keep hitting context limits or token costs, swapping in Nemotron 3 Ultra for the orchestration layer could actually be a no-brainer.
Cody: God, you already have the product slide in your head.
Justy: I’m just saying—thirty percent lower token cost on SWE-bench? That’s real money for teams shipping agents at scale.
Cody: Okay, but look at the EnterpriseOps-Gym number—thirty-three percent on long-horizon planning. That’s behind GLM and Qwen. So it’s not wins.
Justy: Fair. But they’re still leading on PinchBench and Long Context Ruler at a million tokens. And they hit ninety-five percent on Ruler—no one else in the table even a million.
Cody: Yeah… and the open weights and recipes are a big deal. If you’re in a domain where none of the teachers fit, you can fine-tune it with your own data.
Justy: Which, by the way—Multi-Teacher On-Policy Distillation with more than ten domain-specific teachers. That’s how they’re getting the specialization without starting from scratch every time.
Cody: Mm-hm. And the RL pipeline’s fully transparent, so you can audit what the model’s actually learning from.
Justy: Anyway, I flew in late last night and my brain’s still on west coast time, so bear with me—
Cody: You sound like you drank three espressos on the plane.
Justy: I did. But the thing that’s sticking with me is the angle. You don’t need Nemotron for every single call—just the ones where the agent has to think hard.
Cody: Right. So you’d pair it with a smaller, faster model for the routine stuff and only route the complex turns to Nemotron.
Justy: Exactly. And if the throughput and token efficiency pan out, the math might actually work.
Cody: I mean… I’m still side-eyeing that five-x claim until I see third-party repros. But the architecture checks out.
Justy: And the benchmarks—even with the mixed bag—are strong enough that I’d at least kick the tires.
Cody: Yeah. If your agents are blowing through context or tokens, it’s worth a look. If not… eh.
Justy: There it is—the Cody caveat. Always a caveat.
Cody: Someone’s gotta be the skeptic.
Justy: Alright, forty-six-whatever this is. Next time you’re in LA, we’re testing Nemotron on my to-do list. See if it can finally organize my inbox.
Cody: Good luck with that.