Justy: If your coding agent writes twenty files and you still have to hover over it like it's assembling furniture without the manual... then yeah, we have a problem.
Cody: [chuckles] And somehow the demo always ends right before the part where you open the diff and your shoulders drop.
Justy: I'm in DC this week, I got off a very average flight, and Cody handed me coffee like we were about to record for twelve people who deeply care about static analysis.
Cody: That is our exact lane. Also you flew cross-country and still packed like a man who thought spring in DC meant July in LA.
Justy: [laughs] Welcome back to Exploring Next, episode 305. I'm Justy, here with Cody, and today we're talking about harness engineering for coding agent users... which sounds abstract, but it's really about whether AI code can earn any trust at all.
Cody: Yeah. This matters right now because people are actually using these agents on real codebases, not toy apps. And the bottleneck isn't generation anymore, it's supervision. If every AI change still needs full human babysitting, the speed boost kind of evaporates.
Justy: Right, because from a product angle the promise isn't just faster typing. It's lower review toil, fewer dumb loops, fewer tokens burned on retries, and maybe getting to a place where the agent can handle bounded work without somebody staring at every line.
Cody: [exhales] The article's useful move is narrowing what "harness" means. Not everything around the model in the universe, but the outer layer you, the user or team, build for your codebase. Instructions, retrieval, tests, checks, custom rules, all the stuff that shapes and judges the agent's work.
Justy: And the split I liked was guides versus sensors. Guides try to steer before the model acts. Sensors watch what happened after and feed back signals so it can fix itself. That's way more practical than just saying, "well, the model got better this month."
Cody: Exactly. And then she slices those again into computational and inferential. Computational is your CPU-side, deterministic stuff: tests, linters, type checks, structural analysis. Fast, cheap, repeatable. Inferential is semantic judgment, AI code review, LLM-as-judge. Richer... but slower, pricier, and a little slippery.
Justy: Which is kind of the whole thing, right? You want the cheap alarm that goes off every time, and then the expensive smart friend who occasionally notices, "Hey, this technically works but it's weird." [giggles]
Cody: Yes, and the article makes a subtle point there. Inferential sensors are most useful when they produce feedback the model can actually act on. So not vague scolding. More like custom linter messages or review comments phrased so the agent can self-correct on the next pass.
Justy: Positive prompt injection, basically. Which sounds sinister, but in practice it's just better coaching. And that's a real user story: teams already have conventions trapped in somebody's head or buried in old docs. The harness turns those into something the agent can consume every single time.
Cody: [pause] Also, this is where timing matters. Fast computational sensors should sit way left, before commit if possible. Pre-commit hooks, quick tests, type checks. Then heavier things later in CI: broader review, mutation testing, maybe a more semantic pass that looks at the larger change. You don't want to spend premium judgment on code that already failed the easy stuff.
Justy: Cody, you are in documentary narrator mode again. Somewhere a penguin is learning about pre-commit hooks.
Cody: [laughs] In the wild, the junior agent approaches the type checker... no, but seriously, that's the architecture choice. Put deterministic controls beside the agent all the time, and use inferential controls where the extra cost buys you something.
Justy: Midway tangent, your weekend list still alive? Because this article is basically telling teams to keep adding chores forever. New harness rule, new sensor, new drift check... I'm like, this is your apartment whiteboard in software form.
Cody: [sighs] The list is under control. But yes, the human's job becomes steering the harness. If the agent repeats a mistake, don't just fix the diff again. Improve the guide or sensor so the same mistake gets less likely next time.
Justy: That part I buy completely. It shifts the human from line-by-line hall monitor to system tuner. Very PM-friendly too, because once a team sees recurring failure modes, they can codify them. Architecture boundaries, logging standards, bootstrap steps, even code mods with something like OpenRewrite.
Cody: And the article groups the harness into maintainability, architecture fitness, and behavior. Maintainability is the easy win because tooling already exists. Duplicate code, complexity, style drift, missing coverage... machines are good at that. Architecture fitness is things like module boundaries or performance expectations. Behavior is the hard one, because if the spec is fuzzy, no sensor can reliably rescue that.
Justy: Yeah, that's the honest limit. If the human asked for the wrong thing, you don't get correctness by sprinkling judges on top. Green tests can still mean the agent wrote tests for its own misunderstanding, which is like grading your own airport sandwich and giving it an A-minus because you were hungry.
Cody: [chuckles] You would grade a sandwich. But that's the strongest caution in the piece: these harnesses absolutely raise trust for structure and maintainability, somewhat for semantics, and much less for misunderstood intent or unnecessary features. So autonomy goes up in slices, not all at once.
Justy: So if somebody wants to get hands-on after this, I'd start small. Add an AGENTS.md or equivalent instructions file with your repo conventions. Then wire a pre-commit run for tests, lint, and type checks that the agent sees immediately. Give it the bumpers before you ask for fancy moves.
Cody: And build one custom structural check. ArchUnit if you're on the JVM, or a lightweight script that enforces folder boundaries and import rules. Then try custom linter output with actionable fix text. If you're curious about code mods, go read up on OpenRewrite recipes and have the agent help draft one.
Justy: I'm adding that to the list. Not your list, my list. Anyway, that's Exploring Next. If you hear a faint sound after we stop, it's Cody opening another note titled "weekend" and pretending this time he means it.