Justy: This is Exploring Next, episode 334. Xiaomi just dropped two open-source models, MiMo-V2.5 and V2.5-Pro, and they're claiming they're cheaper and more efficient at agentic tasks than GPT or Claude. Cody, you've looked at the benchmarks—what's actually going on here?
Cody: Yeah, so the headline is token efficiency. Their Pro model does claw tasks—that's autonomous agent stuff, right, like having an AI handle your email or publish content—using about 40 to 60 percent fewer tokens than Claude Opus or GPT-5.4. And the pricing backs that up. But here's what I'm skeptical about: benchmarks for agentic tasks are notoriously narrow. They test the model on a curated set of workflows in a lab. Real production agents hit edge cases, they fail partway thr
Justy: But Cody, if you're running thousands of these agents a day, even a narrow win on efficiency compounds. You're talking about running a customer service team that handles email triage, or a content team that preps social posts, all automated. At scale, 40 percent fewer tokens is real money. And it's open source, which means you can download the weights and run it yourself instead of betting on Xiaomi's API.
Cody: [sighs] Okay, fair. The self-hosting angle is legit. You're not locked into calling a Chinese API, you're not exposed to whatever regulatory nonsense might happen. But self-hosting has its own cost—you're running inference on your own infra, you need ops people, you need to monitor for drift. The token savings disappear if you're paying someone a hundred grand a year to babysit the thing.
Justy: True, but for a larger enterprise, that's already baked in. They've got infra, they've got ML ops. What they don't have is a way to run ten thousand agent instances without spending a fortune on tokens. That's where MiMo changes the math.
Cody: Let me push on the architecture side, though. MiMo-V2.5-Pro is a trillion-parameter Mixture-of-Experts model with 42 billion active parameters. That's a clever design—you only wake up the experts you need for each query, keeps inference fast and cheap. But parameter count is not the same as coherence. They did these lab demos: a compiler implementation in Rust that took 4.3 hours, a video editor that took 11.5 hours with nearly two thousand tool calls. That's legitimately har
Justy: Yeah, but those demos aren't nothing. Building a Rust compiler from scratch—that's not a stupid benchmark. That's reasoning over a long chain, managing state, iterating on mistakes. Most models can't do that at all. Even if real-world performance is, say, 60 percent of lab numbers, that's still pretty good for the use cases where you'd use this.
Cody: Fair. I'm not saying the model is bad. I'm saying the benchmarks are task-specific. You're testing it on claw tasks, which are designed to show off coherence and token efficiency. But what if you ask it to reason about ambiguous customer feedback, or synthesize conflicting requirements, or do something that requires real-time fact-checking? That's where you'd probably want Claude or GPT. MiMo is purpose-built.
Justy: Which is fine. Not every model needs to be everything. MiMo is saying, look, we're really good at this specific thing—autonomous agents that run long workflows with minimal hallucination and minimal token spend. That's a real market. There are companies right now that would pay for that.
Cody: Yeah. And the pricing backs it up. The base model is forty cents per million tokens total cost. The Pro model is four dollars. Compare that to GPT-5.4 at seventeen-fifty, or Claude Opus at thirty. If you're running agent swarms, that's a meaningful discount. But here's the catch: that discount only matters if the model is reliable enough that you're not paying your ops team to debug it constantly.
Justy: So where do you actually see this fitting, then?
Cody: High-volume, low-stakes stuff. Email triage. Content scheduling. Data entry. Workflow automation where the cost of an occasional error is just a retry or a human review. Not financial analysis, not legal documents, not anything customer-facing where a mistake is a PR disaster. And honestly, I think the open-source angle is the real win here. You can grab the weights from Hugging Face, run it on your own hardware, and you're not dependent on Xiaomi's infrastructure or politics
Justy: That's actually the thing that's been bugging me about all the closed-source model wars. Everyone's fighting over whose API is fastest or cheapest. But if you can download the weights and run it yourself, you're not part of that game anymore. You control the infrastructure, you control the data, you're not locked in.
Cody: Right. And for a company that's already comfortable with MLOps, that's huge. You trade token savings for ops complexity, but you know what you're getting into. You're not guessing whether Xiaomi's next pricing change or API deprecation will blow up your roadmap.
Justy: Okay, so if someone's actually going to use this, what should they test?
Cody: Two things. First, grab the open-source weights from Hugging Face, spin up a local instance, and run your actual agent workflow on it—not the benchmark workflow, your workflow. See how often it gets stuck, how often it hallucinates, what the end-to-end latency looks like on your infra. Second, if you're seriously considering this for production, run a cost simulation. Take your current token usage, assume 40 percent savings, subtract the ops overhead of self-hosting or the AP
Justy: And the honest read is: this is not a GPT-5.4 or Claude replacement. It's a specialist. But specialists sometimes win in the market because they're cheaper and they don't need to be good at everything. They just need to be good at their thing.
Cody: [chuckles] Yeah. And that's actually fine. The world doesn't need another all-purpose frontier model. It needs models that are cheap, reliable, and open for the jobs that don't require genius-level reasoning.
Justy: That's Exploring Next, episode 334. Thanks for listening.