Justy: So GitHub dropped this whole framework for testing agents when the test itself can't tell a real failure from a loading screen that just took an extra second.
Cody: Right, and here's the thing—it's a genuinely thorny problem. You've got Copilot Coding Agent navigating VS Code, clicking around, waiting for things to appear. Traditional testing assumes every path is the same. But agents are non-deterministic by design. They adapt. So the test fails, the agent actually succeeded, and your CI pipeline just blocks a deploy for no reason.
Justy: That's the false negative thing they keep mentioning.
Cody: Exactly. And the problem gets worse the more realistic the environment. Timing drifts, UI elements render differently, loading screens appear or don't. You could record a trace on Tuesday, run it on Wednesday, and get a different result even though the agent nailed the task both times. The validation broke, not the agent.
Justy: Okay, but how do you actually fix that without just giving up and saying 'we'll accept any behavior as long as the outcome is right'?
Cody: That's where the graph stuff comes in. They're building a Prefix Tree Acceptor—nodes are observable states like screenshots, edges are actions between them. You collect maybe two to ten successful runs, merge them into one unified graph, and apply dominator analysis. That's compiler theory—a node dominates another if every path from the start has to go through it. So you're automatically finding which states are mandatory, like 'search results appeared,' versus incidental, li
Justy: So you're not saying 'click here, then wait, then look for this button.' You're saying 'these states must happen, but they can happen in different orders.'
Cody: Right. And the clever part is the equivalence detection. Two screenshots might look almost identical, or they might have a timestamp change. They built a three-tier system: fast perceptual hashes first, then a multimodal LLM to ask, 'Are these the same logical state?' That feeds into the graph. The LLM isn't validating the whole trace—it's just answering a narrow question: 'Is this a meaningful difference or not?'
Justy: Okay, so I'm sold on the idea. The real question is adoption. Is this built into GitHub Actions natively, or is it a thing you have to wire up yourself?
Cody: That's where I get a little skeptical. This feels great on the blog post—UI automation in VS Code, containerized environments. But production is messier. You've got legacy APIs, weird edge cases, systems that don't render consistently. Does the PTA approach scale to that? And you need successful traces to start with. If you're testing a new agent workflow, you've got to manually seed it with five to ten clean runs. That's a human cost.
Justy: But that's actually way better than writing assertions or maintaining a record-and-replay script. You're doing work once, upfront, and then it generalizes. And if you've got a mature agent workflow already running regularly, you've already got those traces. You're just mining what's already happening.
Cody: True. And the explainability story is huge. If your test fails, you get a clear explanation—'essential state X was never reached'—instead of 'the trace diverged.' That's a trust multiplier for anyone shipping agents to production.
Justy: Right. So the market angle is: if you're using agentic systems in production CI/CD, this solves a real pain. False negatives block deploys. Enterprise teams will pay attention because uptime and reliability matter.
Cody: Okay, so Build Next. If you're already running an agent in a GitHub Actions workflow, start collecting successful traces. Screenshot the UI state at each step, or capture the API responses. Feed those into the PTA algorithm. GitHub probably has a reference implementation by now, or you can sketch it out yourself if you're comfortable with graph merging and dominator analysis.
Justy: And the solo-builder version?
Cody: Weekend project: take a simple browser automation agent—Playwright or Puppeteer—run it through a few successful scenarios, capture the DOM state at each step, build a PTA by hand, identify the essential milestones, and write a validator that checks for those instead of exact sequence matching. You'll immediately see how much noise disappears and how much more robust your tests become.
Justy: That's the piece that matters. You feel the brittleness of the old way, and suddenly the graph approach clicks.