Justy: So Thinking Machines just showed off this thing called 'interaction models' and I genuinely don't know if Cody thinks this is genius or if it's the most overhyped architecture flex I'll hear this month.
Cody: It's somewhere in the middle, but let me start skeptical. The entire pitch is basically 'what if AI listened while it talked instead of waiting for you to shut up first.' And yeah, okay—that's technically different. But is it solving a real problem or is it solving a problem that mostly exists because we shipped turn-based models first?
Justy: Right, but here's the thing—a 1.18-second delay on a customer service call is actually brutal. You call a bot, there's that dead air, and it feels broken even though technically the model is 'thinking.' Thinking Machines gets it down to 0.4 seconds.
Cody: Okay, I hear that. But is 0.4 seconds from a native full-duplex model, or is it just... better inference scheduling on a regular model? Because latency optimization has been a thing for years.
Justy: They're processing 200-millisecond chunks of audio and video simultaneously through the same transformer, not routing through separate encoders. So it's not just inference speed—the architecture is actually different.
Cody: Fair.
Justy: And here's where I think it gets real: they can backchannel while you're still talking. Like, the model says 'mm-hmm' or 'I see' without interrupting you, and then it jumps into the answer. That's not a latency trick—that requires simultaneous input and output.
Cody: That's the genuinely clever bit, I'll admit. But now I'm looking at their benchmark—FD-bench—and it's benchmark designed to measure 'interaction quality.' Justy, you know how this goes. Every company ships a benchmark that makes them look good.
Justy: True, but they also ran it against Gemini and GPT-realtime and crushed them. And the visual stuff—RepCount-A, where the model counts repetitions in a video in real-time—that's not a trick. Either it can do it or it can't.
Cody: Okay, I'm not dismissing the results. 77.8 on FD-bench versus 46.8 for GPT-realtime-2.0 is a real gap. The visual proactivity is legitimately stronger. What I'm questioning is whether this is a 'new class of model' or just the natural next step after we stopped being lazy about inference.
Justy: Maybe both. But here's what matters to me—the enterprise play. In a manufacturing plant, you've got a worker on the floor and a camera watching them. Today, an AI model watches and waits for someone to ask, 'Is this safe?' With Thinking Machines, the model can just... interrupt when it sees a violation. That's not a latency optimization—that's a different interaction paradigm.
Cody: Mm-hm.
Justy: And for customer service, when someone's frustrated, a natural bot that backchannels instead of dead-air processing—that actually changes how people perceive the interaction.
Cody: I buy that. The use case is real. My concern is different—they've got a research preview, no general release yet, no word on pricing, and it's unclear if they're going to open-source any of this. So it's impressive technology on a roadmap, not a product you can ship with today.
Justy: Totally fair. But they mentioned they're 'committed to significant open-source components,' which—if that's real—changes the game. You'd have a reference architecture for native multimodal interaction.
Cody: That's the question mark. Tinker, their fine-tuning thing, didn't light the world on fire adoption-wise. So even if they open-source the interaction model, will anyone actually build on it?
Justy: You're saying they could hand out free genius and we'd all just... not use it?
Cody: Basically, yeah. Integration friction is real. But if someone does build a customer service bot or a manufacturing safety system on this, the latency and visual reaction time would be noticeably better than what we have now.
Justy: So what does Build Next look like? A weekend project to test this?
Cody: Two angles. First: the moment they open the research preview, someone should wire up a live customer service bot—measure latency, backchannel naturalness, customer satisfaction compared to a standard turn-based system. Real metrics, not benchmarks.
Justy: Right.
Cody: Second: if they do release the model weights or an API, build a safety detection system in a simulated manufacturing environment. Camera feed, real-time interruption when something goes wrong. That's where this either proves itself or doesn't.
Justy: And the solo builder angle?
Cody: If you want to play with this right now without waiting, you could prototype a dual-model system yourself—run a fast inference model for chat and a heavier async model for complex reasoning. Not full-duplex, but you'd learn where the friction actually is.
Justy: Fair. So the verdict—is this real or is it marketing?
Cody: It's real technology. Whether it's a paradigm shift or just really good engineering is the question I can't answer yet. The benchmarks are solid, the architecture makes sense, but I won't believe it until I see adoption.
Justy: I'm more bullish. The latency is real, the visual proactivity is real, and the enterprise problems are real. If they price it right and open-source parts of it, this could actually reshape how people interact with AI at work. We'll see in a few months when the research preview lands.