Justy: If this thing can really stay useful for twelve hours, that is a way bigger deal than another model winning trivia night.
Cody: [chuckles] Also way ruder to the rest of us, because now your backlog can judge you all weekend.
Justy: I flew into DC, my body still thinks it's breakfast, and Cody handed me coffee like we were about to record a tiny radio show for six very specific people. Which... fair. But yeah, this one matters.
Cody: You say tiny like that's a problem. [laughs] Meanwhile you're in a light jacket acting betrayed by actual spring weather.
Justy: Welcome back to Exploring Next, episode 306. I'm Justy, here with Cody in DC, and today we're talking about Moonshot's Kimi K2.6, which looks important because people need models that finish work, not just impress in a demo tab.
Cody: Right, and the headline isn't just benchmark chest-thumping. K2.6 is posting strong numbers on agentic coding and tool use, which is where teams actually burn money. SWE-Bench Pro at 58.6, Toolathlon at 50.0, DeepSearchQA at 92.5, and Humanity's Last Exam with tools at 54.0. That's right in the mix with GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro.
Justy: And that matters because if I'm a product team, I don't really care that a model solved some exam question a little better. I care whether it can open files, run tests, keep context, and not wander off like me in an airport bookstore.
Cody: Exactly. And Moonshot is very clearly aiming at execution depth. They even admit K2.6 doesn't lead on pure reasoning stuff like AIME 2026 or GPQA Diamond. So the pitch is basically: maybe we're not the class valedictorian, but we will actually stay in the room and do the group project.
Justy: [giggles] Which, honestly, is the most enterprise sentence imaginable. Cody, break down the part that jumped out at me: four thousand tool calls over twelve hours. Because that is either very real progress or the longest unpaid internship in history.
Cody: The interesting bit is the endurance. One demo had K2.6 optimizing local inference for Qwen3.5-0.8B on a Mac using Zig, which is not exactly the training-wheels language. It kept going for 4,000-plus tool calls over more than 12 hours and reportedly beat LM Studio's throughput by about 20%. Another run refactored exchange-core, an old open-source matching engine, over 13 hours and 12 optimization passes, changing more than 4,000 lines and boosting medium throughput 185%, peak
Justy: That's the part I keep coming back to. Not the leaderboard screenshot, the stamina. Because in the real world, code work is annoying. It's grep, rerun, fail, retry, open another file, realize past-you left a weird comment, keep going...
Cody: ...and not lose the thread. That's the whole game. Plenty of models can look brilliant for ten minutes. Long-horizon work is where they start acting like somebody who promised to assemble a bookshelf and is suddenly just sitting on the floor reading the screws packet.
Justy: [laughs] That is Cody's entire weekend list, by the way. Every chore is apparently entering research preview.
Cody: Unfair. [sighs] The list is healthy. The point is, K2.6 seems optimized for sustained coordination. Moonshot says its swarm setup now supports 300 parallel sub-agents and 4,000 coordinated steps, up from 100 and 1,500 in K2.5. So instead of one monolithic model juggling everything, you split the work into specialized threads and have a coordinator keep them aligned.
Justy: Which sounds great until you're the person paying for all those threads and trying to figure out which one quietly made the weird decision. That's my product-brain hesitation. More agents can mean more output, but also more places for the run to drift.
Cody: Yeah, that's the real trade-off. Parallelism buys speed and specialization, but orchestration quality becomes the product. Routing, memory, failure recovery, tool permissions, evaluation... if any of that is sloppy, 300 sub-agents just means you can be wrong at industrial scale. [pause] The reason I'm taking this seriously is that Moonshot is pairing the architecture claim with actual long-running demos and partner validation from places like Vercel, Baseten, Ollama, and Fact
Justy: The Vercel bit stood out. More than 50% improvement on its Next.js benchmark over K2.5 suggests this isn't just a cosmetic point release. And from a user-story angle, I can see the buyer: teams that want an open model they can wire into coding workflows, internal tools, maybe documentation pipelines, without waiting around for closed vendors to expose every knob.
Cody: And then there's Claw Groups, which is the weirder, more forward-looking piece. Research preview only, but the idea is K2.6 acting as coordinator for a mixed team of agents from different devices and even different models, plus humans in the loop. That's interesting because it assumes the future isn't one giant model doing everything. It's more like a messy office where somebody competent keeps handing tasks to the right people and catches failures before they spread.
Justy: [exhales] That is either the future of work or the fastest way to create an AI version of a group chat where nobody knows who was supposed to bring the slides. My wife asked what we were recording and I said, basically, we're discussing whether software can become middle management. I got a look.
Cody: [chuckles] Fair look. But that's also why this matters now. The market has moved past single-answer chat being enough. If you're evaluating models in 2026, you need to test whether they can produce a real deliverable in one run: a document, a site, a spreadsheet, a code change. K2.6 is making a serious bid there.
Justy: I don't even have a big cynical take here. I'd give the release a B-plus, maybe an A-minus if the real-world replication keeps landing. Yes, I will grade a model launch, and yes, Cody will act like that's unreasonable.
Cody: You'd grade a sunset if it had a product page, Justy. But yeah, the honest take is simple: don't overread a launch post, and don't ignore this either. If a model is strong on the boring, expensive, tool-heavy work, people will use it.
Justy: So if you're messing with this this week, try Kimi through the API or Kimi Code on a bounded repo task. Give it something real like test repair, a refactor, or profiling a slow path. Keep the task scoped enough that you can actually judge whether it stayed coherent.
Cody: Then run the same task with OpenHands or aider, side by side with another strong coding model. Same repo, same tool budget, same success criteria. And if you want the swarm angle, build a tiny pipeline where separate agents research, code, and write release notes, then force one coordinator to merge the output into something you'd actually ship.
Justy: And maybe don't start with 300 agents unless you enjoy turning your laptop into a space heater. That's our trip through Kimi K2.6. Cody's got chores he's still calling a roadmap, I've got a return flight later, and now I kind of want to see who can survive the red-eye rule: me or this model.