Justy: So apparently we're at the point where a three-billion-parameter model can call your API, format the JSON, and not embarrass itself.
Cody: Yeah, I saw that KDnuggets piece. Five small models, all with structured tool calling, all open weights.
Justy: This is basically the thing you keep saying was impossible.
Cody: I did NOT say impossible. I said the gap was still wide. There's a difference.
Justy: Mm-hm.
Justy: Anyway, you got in late? You look destroyed.
Cody: Redeye from the west coast. I am running on whatever this airport coffee is.
Justy: Pathetic. I actually slept eight hours. First time in like three weeks.
Cody: Okay, showoff. But yeah, the article. The claim is that agentic AI lives or dies on tool calling, and these five small models are finally closing the gap with the big frontier ones.
Justy: Right, right.
Cody: They start with SmolLM3-3B. Hugging Face, three billion parameters, decoder-only with Grouped Query Attention and NoPE, which is their no-positional-embedding thing. Sixty-four K native context, up to one twenty-eight with YaRN extrapolation. Trained on eleven point two trillion tokens, post-trained with something called Anchored Preference Optimization.
Justy: APO, yeah. I saw that paper.
Cody: The interesting bit is the dual tool interfaces. XML blobs through xml_tools and Python-style function calls through python_tools. That's unusually flexible for a model this small.
Justy: Okay, but who is reaching for a three B model when they could just call GPT-4?
Cody: If you're running on an edge device or a machine with eight gigs of VRAM, you literally cannot run the big stuff. Plus, Apache two, fully open, weights and training code. SmolLM3 is built for people who need to ship without sending everything to an API.
Justy: Fair. I do love a model I can actually host.
Cody: Then there's Qwen3-4B-Instruct from Alibaba. Four billion parameters, but three point six excluding embeddings. Thirty-six layers, GQA with thirty-two query heads and eight KV heads. The big number is two hundred sixty-two thousand tokens of native context.
Justy: That's absurd for four billion.
Cody: Right? And it's non-thinking only, so optimized for fast responses. Hundred-plus languages, native tool calling through Qwen-Agent and MCP.
Justy: Wait, MCP? As in the Anthropic protocol?
Cody: Yeah, they adopted the Model Context Protocol. That's actually a big signal, because if a Chinese lab is building around Anthropic's open standard, that standard is winning.
Justy: Okay, so the article's central argument is that these aren't just toy demos. They're actually viable for production agentic pipelines.
Cody: That's the claim. My read? It's directionally true but the 'first-class' label is doing more work than the author admits. Frontier models still win on complex multi-hop reasoning where tools depend on each other. These small models are great for single-tool or shallow chains.
Justy: Which is like eighty percent of real use cases, though.
Cody: Probably, yeah.
Justy: For product teams, the question is always latency and burn. If I can run this on a cheap GPU instead of paying per-token to OpenAI, that's not nothing.
Cody: It's definitely not nothing. I just don't want people thinking the gap is gone. It's shrinking, not closed.
Justy: Noted. You'll say 'I told you so' when someone's three B agent loops forever.
Cody: I absolutely will.
Cody: The thing the article doesn't dig into enough is the evaluation. They list specs but don't show head-to-head success rates on real tool-use benchmarks. I'd love to see how SmolLM3's xml_tools mode actually performs against Claude on something like BFCL or ToolBench.
Justy: So your take is: exciting, directionally correct, but bring your own benchmarks before you ship.
Cody: Exactly. And if you're already in the Qwen ecosystem, the MCP support is genuinely convenient. If you're in Hugging Face land, SmolLM3 is probably the easiest on-ramp.
Justy: Good enough for me. Go sleep, Cody.