Izzo: Researchers just figured out how to make AI do science while they sleep.
Izzo: You're listening to Exploring Next, episode two-thirty-nine. I'm Izzo, and I'm here with Boone to talk about something that might fundamentally change how research gets done.
Boone: Andrej Karpathy dropped a 630-line script called autoresearch, and it's basically the scientific method running in a loop.
Izzo: Right, but here's why this matters right now — we're all drowning in hyperparameter tuning. Every ML team I know spends weeks tweaking learning rates and architecture depths manually.
Boone: And Karpathy just automated the entire workflow. The agent reads its own source code, forms a hypothesis like 'what if I bump the learning rate,' modifies the code, runs the experiment, checks if validation loss improved.
Izzo: If it's better, keep it. If not, revert and try something else. Boone, break down how this optimization loop actually works.
Boone: So you give the agent a training script and a fixed compute budget — typically five minutes on a GPU. It's constrained, which is brilliant because it forces fast iterations.
Izzo: That's the key insight I'm seeing. It's not about having infinite compute, it's about velocity.
Boone: Exactly. In one overnight run, Karpathy's agent completed 126 experiments, driving loss from 0.9979 down to 0.9697. That's real progress happening at silicon speed.
Izzo: And then he left it running for two days and it made 700 autonomous changes. Found twenty improvements that transferred perfectly to larger models.
Boone: The transfer part is huge, Izzo. Usually when you optimize for one model size, the gains don't carry over. But these agents found genuinely general improvements.
Izzo: The business impact hit me when I saw that 'Time to GPT-2' metric dropped from 2.02 hours to 1.80 hours. That's an eleven percent efficiency gain on something Karpathy thought was already optimized.
Boone: Right, and he mentioned the agent caught oversights in attention scaling that he'd missed manually over twenty years of work. That's humbling.
Izzo: But here's where it gets wild — people immediately started distributing this across networks. Tell me about what happened with Hyperspace.
Boone: Varun Mathur took the single-agent loop and spread it across a peer-to-peer network. Thirty-five autonomous agents ran 333 experiments in one night, completely unsupervised.
Izzo: And the emergent behavior was fascinating. Hardware diversity became a feature, not a bug.
Boone: Yeah, the H100 GPUs were doing brute force approaches with aggressive learning rates, but the CPU-only agents on laptops had to get clever. They focused on initialization strategies — Kaiming, Xavier init — because they couldn't rely on raw throughput.
Izzo: That's actually brilliant product strategy. The constraints forced innovation.
Boone: And they used GossipSub protocol for real-time sharing. When one agent discovered Kaiming initialization dropped loss by twenty-one percent, it spread through the network like a virus.
Izzo: Within hours, twenty-three other agents incorporated that discovery. It's like watching evolution in fast-forward.
Boone: The timeline compression is mind-bending. In seventeen hours, these agents independently rediscovered ML milestones that took human researchers at Google Brain and OpenAI eight years to formalize.
Izzo: RMSNorm, tied embeddings — all the classics, just happening overnight. But Boone, this isn't staying in ML land.
Boone: Eric Siu applied it to marketing experiments. Instead of the typical thirty experiments a year, he's talking about thirty-six thousand.
Izzo: The framework is identical — replace the training script with a marketing asset, modify variables like subject lines, measure positive reply rate, keep or discard.
Boone: It creates what he calls a 'proprietary map' of what resonates with your audience. The companies that win won't have better marketers, they'll have faster experiment loops.
Izzo: I'm giving this concept an A-minus. The technical execution is elegant, the applications are immediate, and the network effects are already happening.
Boone: The community raised valid concerns though. Someone asked about 'spoiling' the validation set — if you run enough experiments, are you just optimizing for quirks in your test data?
Izzo: That's the classic overfitting problem scaled up. But Karpathy's response was solid — all we're doing is optimizing performance per compute, and these are real gains. Plus there's something beautiful about one user's insight — their Mac Mini M4 ran thirty-five experiments overnight, twenty-six failed, but the seven that worked revealed that the model got better by getting simpler. That discovery happened without human intervention. The agent found that less is more. I'm defin