Izzo: Every coding agent demo shows the same thing — write a function, pass the tests, ship it. But what happens when you need to extend that code six months later?
Izzo: You're listening to Exploring Next, episode two-fifty-one. I'm Izzo, and with me is Boone. Today we're diving into SlopCodeBench — research that just shattered every assumption about how good AI coding agents actually are.
Boone: And Izzo, this isn't about whether agents can solve coding puzzles. It's about whether they can build software that doesn't turn into an unmaintainable nightmare.
Izzo: Right. Because here's what I see in product land — everyone's excited about agents writing code, but nobody's asking what happens when you need to iterate on that code. Which is, you know, the entire software development process.
Boone: Exactly. So Gabriel Orlanski's team at UW-Madison built the first benchmark that actually measures this. They call it SlopCodeBench, and it's brutal.
Izzo: How brutal are we talking?
Boone: Zero agents — across eleven different models — solved any complete problem end-to-end. The best checkpoint solve rate was seventeen-point-two percent.
Izzo: Wait, zero? Not even GPT-4 or Claude?
Boone: Nope. And it gets worse. They tracked code quality across iterations using two metrics: verbosity, which measures redundant and duplicated code, and structural erosion — basically how much complexity gets jammed into individual functions.
Izzo: Boone, break down how this benchmark actually works. Because most coding tests are just 'write a function that sorts an array,' right?
Boone: Right, and that's exactly the problem. SlopCodeBench gives agents twenty problems with ninety-three checkpoints total. But here's the key — it forces architectural decisions without prescribing internal structure.
Izzo: Meaning what in practice?
Boone: So instead of 'implement quicksort,' it's more like 'build a data processing system, now add filtering, now add aggregation, now add real-time updates.' Each extension builds on your previous code.
Izzo: Ah, so the agent has to live with its own architectural choices. That's... actually genius. Because in real product development, you can't just rewrite everything when requirements change.
Boone: Exactly. And the results are devastating. Quality degrades in eighty percent of trajectories for structural erosion and almost ninety percent for verbosity.
Izzo: What does that degradation actually look like? Give me specifics.
Boone: So they compared agent code against forty-eight open-source Python repositories. Agent code was two-point-two times more verbose — so more than double the redundant code.
Izzo: Ouch.
Boone: And when they tracked twenty of those real repositories over time, human code quality stayed flat while agent code deteriorated with each iteration. It's like technical debt on steroids.
Izzo: This explains so much of what I'm seeing in the field. Teams get excited about AI-generated code, ship the first version, then three months later they're rewriting everything because it's unmaintainable.
Boone: The paper calls it 'extension robustness' — how well code survives being extended. And current benchmarks completely miss this.
Izzo: So who actually uses this research? Because it feels like the entire 'AI will replace developers' narrative just took a massive hit.
Boone: I think it's the opposite — it shows us exactly what to work on. Instead of focusing on single-shot code generation, we need agents that understand software architecture and long-term maintainability.
Izzo: Right. And from a product perspective, this creates a huge opportunity. The company that solves iterative code quality is going to own the enterprise AI coding market.
Boone: They did try one intervention study — prompting agents to focus on initial code quality. It helped the first iteration but didn't stop the degradation.
Izzo: So it's not just about writing better prompts. The fundamental approach needs to change.
Boone: Exactly. We need agents that think about code evolution, not just immediate functionality. Maybe something that maintains architectural invariants across iterations.
Izzo: I'm giving this research an A-minus. It's the first benchmark that measures what actually matters in software development, and the methodology looks solid. Agreed. And it's language-agnostic, so you can test this across Python, JavaScript, whatever you're working in. So what should people actually go build with this? First, clone the SlopCodeBench repo and run your favorite coding agent through it. See how badly it fails at maintaining code quality over iterations. Second, if