Justy: So we're in a world where AI agents are supposed to be coworkers, right? Like, they stick around for days, they help you with email and calendars and files. But nobody's actually measuring whether they can handle that.
Cody: Exactly. Every benchmark out there runs the agent inside a single session where the environment is frozen. You give it a task, it completes, environment resets. But real office work doesn't work that way. An email arrives while you're thinking. Someone updates a spreadsheet. The calendar shifts. And the agent has to notice, adapt, and keep going.
Justy: Right. And the paper's called ClawMark—it's a benchmark specifically for multi-day, multi-turn agent workflows. That's what caught my attention. So what's actually different here compared to, say, OSWorld or tau-bench?
Cody: ClawMark combines three things that no other benchmark does together. First, tasks span multiple in-universe workdays—each day is a separate turn. Second, the environment changes between turns independently of the agent. Not because the agent made changes, but because the world did. We call them 'loud events'—announced changes—and 'silent mutations'—unannounced ones. Third, evidence is fully multimodal: PDFs, scanned documents, audio, video, spreadsheets, not just text.
Justy: Okay, so the infrastructure—they built actual stateful services, right? Not just mocked logs.
Cody: Yes. Five of them, all running in Docker. Real filesystem, GreenMail for email, Notion-compatible knowledge base, Google Sheets–compatible spreadsheet, Radicale for calendars. They're actual stateful services, not snapshots. Between turns, they mutate. The agent sees the real current state when it wakes up on Day 2, not a cached version from Day 1.
Justy: And the scoring is deterministic. No LLM-as-judge.
Cody: Exactly. 1,537 Python checkers inspect the post-execution state of each service. Each task is only admitted to the benchmark after two independent runs produce bit-identical checker verdicts. It eliminates subjectivity entirely.
Justy: So what do the results actually show? Because a 75.8 weighted score sounds good on paper.
Cody: The gap between weighted score and strict task success is the story. Claude Sonnet 4.6 leads at 75.8 weighted, but the best strict task success is only 20%. That means agents are making progress—partial credit—but almost never completing the full workflow end-to-end. On tasks with exactly three turns, six out of seven models degrade on Day 2, right after the first exogenous environment update hits.
Justy: [pause] So the agent sees the world change and just... loses track.
Cody: Or doesn't notice it changed at all. The paper calls it out explicitly: 'adaptation to changing state as a key open challenge.' This isn't about reasoning or tool use. It's about maintaining context when the ground truth shifts underneath you between turns.
Justy: That's a real problem for a coworker agent. So who's actually going to use this benchmark?
Cody: It's open-source—github.com/evolvent-ai/ClawMark—so anyone can run it. The repo includes the evaluation harness, the five service implementations, all 1,537 checkers, and the construction pipeline. For teams building coworker agents, this becomes part of your validation pipeline before you ship. The deterministic scoring means you can run it in CI/CD.
Justy: What about solo builders? Someone working on an agent framework who wants to test their own system?
Cody: The construction pipeline is the key. You can use it to generate new scenarios without running the full 100-task suite. Start with one service—say, email and calendar—and build a three-day task where external events happen. Run your agent, check the deterministic verdict. You get real feedback on adaptation without needing the whole benchmark infrastructure.
Justy: And the core insight here is that coworker agents aren't about single-turn reasoning. They're about persistence and adaptation when the world changes. ClawMark measures that in a way nothing else does right now.
Cody: Exactly. The 20% strict task success rate is actually the honest number. It's not a failure of the benchmark; it's a failure of current agent systems to handle real coworker workflows. ClawMark just made it visible.
Justy: This is Exploring Next, episode 342. We'll link the benchmark and the paper in the show notes. If you're building agent systems, this is worth your time.