Cody: Okay so I just read this paper and I need to know if I’m missing something.
Justy: Mm-hm.
Cody: They’re letting an LLM be its own data engineer. Like, the whole pipeline—planning, generating, testing, iterating—no humans touching it.
Justy: Right.
Cody: And they’re saying GPT-5.2 built a curriculum that improved a student model by fifty-seven percent.
Justy: Fifty-seven POINT twenty-nine.
Cody: Of course you remember the decimal.
Justy: I mean, that’s the headline. And it’s NOT a typo.
Cody: So the stuck problem here is obvious: we’ve been hand-rolling these data curation workflows forever, right? Domain adaptation always needs domain data, and getting it is slow, expensive, or both.
Justy: Exactly. And for most teams, the moment you move off general tasks—finance, legal, internal docs—you’re basically stuck unless you’ve got a dataset someone already built.
Cody: Which is never tailored enough. So they’re flipping it: what if the model just writes its own training data, tests it, and keeps rewriting until it works?
Justy: And the kicker is they’re treating data like code. Optimize, measure, iterate.
Cody: Yeah. So the agent starts by defining the domain—say, medical records—then designs prompts, synthesizes a dataset, trains a student model on it, evaluates the student on a test set…
Justy: Mm-hm.
Cody: …and if the student sucks at, I dunno, extracting dosage info, the agent goes back and generates more data targeting that specific gap. Rinse, repeat.
Justy: Okay, so this is the part where I’m supposed to say this is Exploring Next gold, right? Agent-driven specialization.
Cody: You would.
Justy: But come on—imagine shipping this. You’re a startup with zero labeled data in your niche. You fire up DataAgent, point it at your docs, and a week later you’ve got a model that actually understands your stuff.
Cody: A week? Justy, you’re assuming the compute budget of a small country.
Justy: Fine, a month. With cloud costs that make your CFO cry.
Cody: And that’s before we talk about the feedback loop problems. If your eval set’s even slightly skewed, you’re just teaching the model to game the metric.
Justy: Right, right. But the paper’s not claiming it’s production-ready. It’s saying the capability exists. Autonomous data engineering as a measurable thing.
Cody: Fair. And the code’s on GitHub—DataAgent. So if someone wanted to poke at it, they could.
Justy: So you’re saying it’s research-only for now.
Cody: I’m saying if I were building this, I’d start with a tiny domain and a very tight eval harness. And I’d still expect to debug for weeks.
Justy: Meanwhile, my brain’s already writing the product spec. ‘Just add your docs, we’ll handle the rest.’
Cody: That’s such a you move.
Justy: Anyway. Flight’s delayed, so I’m just sitting here in the airport lounge, reading papers like a weirdo.
Cody: Of course you are. What’s your ETA?
Justy: Another two hours. Anyway—this thing feels like it could un-block a lot of teams.
Cody: Maybe. But I’d bet good money the first three companies that try it hit a wall on eval data quality.
Justy: Or they overfit to their own benchmarks and ship a model that’s amazing at the test set and useless in the wild.
Cody: Bingo. And that’s episode four-forty, I guess.
Justy: God, we’re up to four-forty already? Anyway—DataAgent’s on GitHub if you’re brave. I’m gonna go find a coffee that doesn’t taste like jet fuel.