Justy: The weird part of agent building right now is the smartest systems can still feel impossible to inspect.
Justy: This is Exploring Next, episode 313. I'm in Cody's kitchen in DC, still calibrating after the flight, and we're talking about AgentSPEX.
Cody: What grabbed me is the target. They are going after that awkward middle ground where plain ReAct prompts are too loose, but Python orchestration frameworks get tangled fast. If you've built with LangGraph or DSPy or CrewAI, you've probably felt that.
Justy: Yeah, and that matters now because more teams want agents to do long jobs, not just one-shot chat. Research flows, software tasks, proposal writing, anything with loops and branching. Product people want to change behavior without reopening a pile of orchestration code.
Cody: AgentSPEX puts the workflow in YAML as an executable spec, using primitives like task, step, if, while, call, parallel, gather, and state operations instead of scattering logic across Python and prompts.
Justy: And the plain-language version is, your agent gets an actual flowchart. Not just vibes and a giant system prompt. [chuckles] That is a lot easier to reason about when something goes sideways.
Cody: They also split specification from the runtime harness, which adds tools, sandboxing, checkpointing, logging, replay, and resume. That makes the workflow more portable in principle.
Justy: The explicit state piece matters too. Workflows keep named variables, steps can save outputs, and templates pull in only what each step needs, which helps control context creep.
Cody: They even make conversation history explicit: a task can start fresh, while a step can continue a persistent thread. That gives you more control over memory, performance, and reproducibility.
Justy: The example that made it click for me was a research assistant that generates search queries, calls a search-and-summarize submodule, then writes a report. Not flashy, but very shippable.
Cody: And submodules are just workflows calling other workflows, with parallel and gather for concurrency. So you can compose a deeper agent from smaller pieces instead of one monster prompt.
Justy: This is where I think the audience splits. If you're a solo builder, this could be a nice way to keep a weekend project understandable. If you're a company, I could see it fitting research ops, internal assistants, paper triage, maybe software workflows where auditability matters. The question is whether it's research-only.
Cody: My read is no. The runtime features make it feel closer to deployable infrastructure than a demo, though YAML can still become a maze if the workflow grows without discipline.
Justy: I do like that they paired benchmarks with a user study on interpretability and accessibility. Authoring experience is part of the product, not just task scores.
Cody: Methodology-wise, I buy the broader claim more than any single number: explicit control flow should help on long-horizon tasks. I'd still want stronger ablations around context management and debugging nested state.
Justy: So if someone wants to build next, the obvious starting point is the GitHub repo, ScaleML slash AgentSPEX, and try one of the ready-made deep research or scientific research agents.
Cody: Then do a side-by-side. Recreate the same workflow in AgentSPEX and in a Python graph framework. Measure edit time, number of files touched, and whether you can replay a failed run. That's a very honest comparison.
Justy: For a solo builder, I'd make a tiny literature scout. One workflow to generate search queries, one submodule to summarize sources, one final writer step to produce a markdown brief. Keep context tight on purpose so each step only sees what it needs.
Cody: And if you want to stress the language, add parallel searches plus a while loop with an iteration cap for follow-up queries. That gets you branching, concurrency, and state without building a huge system. [sighs] Which is probably enough complexity for one weekend.
Justy: That's AgentSPEX. A more inspectable way to build agents, provided your flowchart doesn't turn into a spaghetti diagram on Cody's countertop.