Justy: The part that hits me is how many teams already have traces everywhere, and still don't know what to fix on Monday.
Cody: Yeah. The raw data problem got solved before the triage problem. So this launch matters because it's trying to turn production mess into a ranked work queue instead of a guilt pile.
Justy: Right, and that's a very real product pain. You can have an agent in production, people are clearly having a weird time with it, and the team is still stuck reading individual traces like it's forensic work.
Cody: Mm-hm.
Cody: What LangSmith Engine seems to do is run that whole loop continuously. It watches production traces, groups repeated failures into named issues, checks signals like explicit tool errors, evaluator failures, anomalies, bad user feedback, even cases where people ask for stuff the agent was never built to handle.
Justy: I got in late last night and made the mistake of coffee at, like, nine. So my brain is somehow tired and overclocked. Anyway, that actually feels relevant, because a lot of agent teams are operating exactly like that.
Cody: Exactly. And the clever bit is it doesn't stop at, hey, something failed. If your repo is connected, it tries to diagnose the root cause against the code or prompt setup, then drafts a PR and also proposes eval coverage so the same class of failure gets watched going forward.
Justy: That eval part is maybe the strongest product story, honestly. Not just fixing the bad thing, but turning the bad thing that escaped into a test case so it stops sneaking back in after the next deploy.
Cody: Right, right.
Cody: Their example was pretty concrete. Support bot, users ask about canceling a subscription, the bot handles it badly, online evals mark it as failure, user feedback is negative, but latency is normal so no systems alert goes off. Engine surfaces one issue with severity, says it hit 12 percent of support sessions that week, notes it started four days ago and lines up with a deployment.
Justy: And then it claims it can read the code and figure out the cancellation tool description is ambiguous. Which, I mean, I can believe sometimes. That's also the spot where trust gets won or lost.
Cody: Sure.
Cody: Yeah, same. I buy the clustering more easily than I buy perfect diagnosis. But even a decent draft is useful if the alternative is a human spending two hours spelunking through long traces and then writing the same eval by hand. Their own framing is review-and-merge, not blind autopilot, which feels sane.
Justy: Who I think buys this first is not the person hacking on a toy chatbot over a weekend. It's the team that already has LangSmith tracing, probably some online evaluators, a repo with enough structure, and enough support or ops volume that repeated failures actually matter in dollars or churn.
Cody: Yeah.
Cody: Because it's built on top of existing LangSmith plumbing. They keep saying no new infrastructure, which is true only if you're already in their world. Connect a tracing project, optionally connect the repo, and it starts surfacing issues. If you're not already tracing and evaluating, the barrier is still the setup discipline.
Justy: That's the adoption wrinkle. The product pitch sounds simple, but the prerequisite is you already believe in observability for agents. A surprising number of teams still ship prompt changes with basically vibes and screenshots.
Justy: You know I'm right, Cody.
Cody: No, fully. The phrase is harsh, but fair. Also, this is why the online-plus-offline combo matters. Engine uses existing evaluator results as input for issue detection, and when it finds a gap, it proposes a custom online evaluator for that exact issue and pulls failing traces into an offline dataset with per-example criteria.
Justy: That closes a loop a lot of teams never close. Production becomes the source of the next eval suite, instead of this separate spreadsheet project somebody keeps postponing after standup.
Cody: Oh interesting.
Cody: And architecturally, I like that it's a deep agent sitting above traces, evaluator feedback, and code, rather than pretending one model call can do all of this. The trade-off is obvious, though. More access means more trust, more permissions, more questions about whether you want an external system reading your repo and proposing commits.
Justy: I could be wrong, but that probably splits the market. Some teams will use issue surfacing and dataset generation immediately, then hold off on repo access until they've seen a few good catches. That's a very normal landing path for this kind of tool.
Cody: I think that's right. Also, the named-issue abstraction is sneakily important. Reading one bad trace is emotionally convincing but statistically useless. Reading a cluster with a timeline, severity, recurrence, and links back to evidence is way closer to how teams actually prioritize work.
Justy: And it gives product people something legible. Not just, the model seemed weird, but, this failure mode started after that release and touched this slice of sessions. That's the difference between vague concern and an actual ticket.
Cody: Small side note, your kitchen somehow has no normal mugs. I'm drinking coffee out of what feels like a soup decision. Anyway, build-next wise, the obvious move is: if someone already uses LangSmith, connect an existing tracing project and a non-critical repo in the beta and watch what issue clusters show up.
Justy: The mug is ambitious, not wrong.
Cody: For a solo builder, I'd do a tiny version of the pattern this weekend. Log traces, tag failures manually or with a simple evaluator, cluster similar bad runs by embedding or even keyword rules, and dump the worst cluster into a dataset. Then write one targeted eval that would have caught it. Even without Engine, that workflow teaches the habit.
Justy: And for teams already deeper in the stack, I'd compare three things after a week or two: did the surfaced issues match what humans would have found, were the drafted fixes directionally right, and did the new evaluators actually catch regressions later. That's the real scorecard, not the demo path.
Cody: Yeah, that's the whole game. If it saves triage time and steadily hardens the eval suite, it's useful. If it mostly generates busywork with fancy labels, people will bounce fast.
Justy: I think that's a good place to leave it. Episode 395 and we're still learning that the fancy mug is less important than whether it keeps the coffee in, same as the agent loop, honestly.