Justy: Okay, Cody, this one got me fast. The real point is not "agents need evals". It’s that the old benchmark mindset barely tells you anything once a model can loop, use tools, and keep going on its own.
Cody: Yeah.
Cody: And I think that mostly holds. If the system can act over time, then a tidy question-answer score is almost decorative. It says something about the model, but not much about the agent you actually shipped.
Justy: Also, I’m running on weird sleep because I got in late and then spent this morning fighting my calendar app like it was a personal enemy. Which honestly made me extra receptive to "please do not let an agent touch real workflows unless you can measure it."
Cody: That is fair. I lost an hour today to a local dev setup that broke because one package decided its own versioning was a creative writing exercise. So yes, anyway, evaluating systems that can keep making decisions after the first mistake feels very real.
Justy: The article’s clearest move, I think, is defining the difference pretty narrowly. An agent is basically an LLM using tools inside a loop, with enough autonomy to judge intermediate results and recover from errors instead of just answering once.
Cody: Right, right.
Justy: That sounds simple, but it matters because it shifts what counts as failure. Not just wrong answer. Wrong tool, bad tool args, unnecessary steps, getting stuck, or quietly doing something dumb in the environment.
Cody: The Qwen3 example helps there. He walks through the XML-style tool tokens, like tool, tool call, and tool response, and the actual inference loop where generation stops, the call gets parsed, the tool runs, then generation resumes with the result in context.
Justy: Mm-hm.
Cody: That’s useful because it makes the eval problem concrete. You can score whether the model called a tool when it should have, picked the right one, produced valid structure, followed a sensible trajectory, or just reached the right outcome.
Justy: And I liked that he doesn’t treat tool use as this magical layer. It’s almost annoyingly mundane. Tokens, schemas, docs, retries, error handling. Which is very much your love language, by the way.
Cody: I mean, yes. The line about asking whether a human engineer could use the tool from the docs is probably the most grounded sentence in the whole piece. Bad tools poison the eval, because then you can’t tell whether the agent failed or your interface was just mush.
Justy: That part changed the practical read for me. If a team is shipping a coding agent, or some workflow agent that can change tickets or schedules or whatever, they should care NOW. A slick demo is not evidence that the thing is reliable across long tasks.
Cody: Exactly.
Cody: Where I’d push a little is trajectory scoring. It’s helpful, but it can get weird fast. In real tasks there may be several valid paths, so a single ground truth sequence can punish an agent for solving the problem differently, or even better.
Justy: Yeah, that felt like the main place where the framework can overreach. If you score the path too rigidly, you end up rewarding obedience to your harness instead of competence.
Cody: Oh interesting.
Cody: Right. Outcome-only evals have the opposite problem, though. The agent can stumble into a correct answer after wasting tools, taking risky actions, or relying on brittle behavior. So his broader argument still works. You need multiple views, not one score pretending to be truth.
Justy: This is such an episode four hundred fifteen problem. We invented a machine that can kind of operate software, and now we’re like, cool, how do we grade its judgment without lying to ourselves.
Cody: And without building a fake little school for robots.
Justy: Which, to be clear, is exactly what eval harnesses are. Tiny weird obstacle courses. Very serious industry, deeply normal behavior.
Cody: I do think the article is strongest when it says stop relying on anecdotal checks. That’s the part people need to hear. If the agent is headed for coding, medicine, or any workflow with real consequences, vibes are NOT an eval.
Justy: And for everybody else, I think the practical takeaway is smaller. If you’re just using an LLM for narrow text tasks, this probably doesn’t change your week. But if the product story involves autonomy, tool use, and "it can handle the whole thing," then yeah, Cody, you need a harness before you need a launch post.
Cody: That is annoyingly correct, Justy.
Justy: Great. Let’s end there before you make me build a rubric for my calendar app.