Izzo: What if your coding model could get better just by learning from its own mistakes?
Izzo: You're listening to Exploring Next, episode two-sixty-four. I'm Izzo, and with me is Boone. Today we're diving into Apple's Simple Self-Distillation paper—a technique that's embarrassingly simple but surprisingly powerful.
Boone: And when Apple says 'embarrassingly simple,' they mean it. No teacher models, no verification, no RL. Just sample from your model and fine-tune on those raw outputs.
Izzo: Which sounds almost too good to be true. But they took Qwen3-30B from 42% to 55% pass@1 on LiveCodeBench. That's a thirteen-point jump from doing basically nothing.
Boone: Right, and it works across five different models—Llama, Qwen, different sizes. The gains are real.
Izzo: So Boone, walk me through what they're actually doing here. Because sampling your own outputs and training on them feels like it should just reinforce whatever the model was already doing wrong.
Boone: That's the intuition, but they're being clever about temperature settings. During training, they sample at one temperature—call it T-train—then at inference time, they use a different temperature, T-eval.
Izzo: And that temperature shift is doing the heavy lifting?
Boone: Exactly. They sample training data at higher temperatures to get more diversity, then deploy at lower temperatures for more focused outputs. But here's the key insight—they identify what they call the precision-exploration conflict.
Izzo: Break that down for me.
Boone: Think about code generation. You've got fork positions where multiple approaches are valid—like choosing between a hash map or a tree structure. Then you've got lock positions where syntax is rigid—you can't just make up function names.
Izzo: Ah, so you want high temperature at the forks for creativity, but low temperature at the locks for correctness.
Boone: Right! But with traditional decoding, you pick one global temperature. It's always a compromise. SSD reshapes the model's distributions in a context-dependent way.
Izzo: That's actually brilliant. Instead of fighting the temperature knob at inference time, you're baking that intelligence into the model weights.
Boone: And the mechanism is what they call support compression and within-support reshaping. At lock positions, it suppresses low-probability distractors. At fork positions, it preserves useful diversity.
Izzo: From a product perspective, this is huge. Every company building coding assistants is hitting the same wall—how do you get better without massive human labeling or complex RL setups?
Boone: The operational simplicity is the killer feature. You need prompts and compute. That's it. No execution environments, no test suites, no reward engineering.
Izzo: And they only needed one sample per prompt. Not even multiple candidates. The dataset is ten thousand competitive programming problems from rSTARcoder.
Boone: With minimal filtering. They just removed empty responses and single-line stubs. No correctness signal whatsoever.
Izzo: Wait, let me make sure I understand the training setup. They're using standard supervised fine-tuning—just cross-entropy loss over the sampled sequences?
Boone: Yep. Two thousand five hundred iterations for instruct models, three hundred for thinking models. Learning rate five times ten to the minus six. Nothing exotic.
Izzo: This feels like one of those techniques that's going to propagate fast. The barrier to entry is so low.
Boone: And the gains concentrate on harder problems, which is where teams actually need the help. Easy problems already work fine.
Izzo: The coverage improvements are even more impressive. Hard problem pass@5 went from 31% to 54%. So it's not just getting one answer right—it's exploring the solution space better.
Boone: That suggests the model is learning to maintain diversity where it matters while being precise where it counts. The theoretical analysis in their appendix backs this up.
Izzo: From a competitive standpoint, if I'm running a coding assistant product, this is probably shipping in the next quarter. The risk-reward is too good.
Boone: Especially since it generalizes across model families. They tested Llama and Qwen, both instruct and thinking variants. It's not some weird architectural quirk.
Izzo: The thinking models are interesting—those needed fewer training iterations but still saw gains. Makes sense if they're already better at reasoning through the problem space. And they validated on LiveCodeBench v6, which covers February to May 2025 problems. So it's not overfitting to some old benchmark. Boone, any concerns with this approach? It feels almost too clean. The main risk is probably mode collapse if you iterate too many times. They only did one round of SSD, but i