Justy: If your bigger model is tutoring your smaller model into getting worse, you do not have a scaling strategy. You have an expensive self-own.
Justy: Welcome back to Exploring Next, episode 300. I’m Justy, Cody is here, and today we’re getting into a paper with a very practical question hiding under a very academic title: how do you fine-tune a reasoning model without accidentally sanding off the parts that already work?
Cody: Yeah, and I like this one because it starts from a failure case people have definitely hit in the wild. You take a strong reasoning model, synthesize supervised data, train the smaller model on it, and then... benchmark goes down. That is such a rude result, but it feels real.
Justy: That’s the hook for me. Everybody wants the cheaper model with most of the brains. Product teams want lower latency, lower inference cost, easier deployment. If the standard playbook for distillation or SFT breaks on reasoning models, that blocks an actual market, not just a leaderboard hobby.
Cody: Right, and the paper is pretty specific about where it breaks. They use GPT-OSS-120B as the teacher and Qwen3-8B as the student on code generation. Directly fine-tuning Qwen3-8B on teacher-generated data drops performance by 3.25 percent on LiveCodeBench-Pro and 10.02 percent on OJBench. That is not noise. That is the model learning the wrong thing.
Justy: Also, ten points on OJBench is the kind of drop that gets somebody in a planning meeting saying, no no, we tried synthetic data, didn’t work, moving on. I’m giving that outcome an F. A deeply committed F.
Cody: An F for the baseline? Justy, some of us enjoy watching bad baselines suffer. It builds character.
Justy: Okay, that’s our warning sign. So Cody, break down the paper’s claim in plain English. What do they think is actually going wrong?
Cody: They argue the issue is stylistic divergence between teacher and student. Reasoning models don’t just output answers. They have a whole internal response format, a cadence, connective phrases, little discourse habits in the thinking trace. The teacher may have better solution patterns, but if its response distribution looks very different from the student’s native style, supervised fine-tuning forces the student to imitate a foreign accent while also learning the task.
Justy: That lands for me. In product terms, you’re not only teaching the kid better math, you’re also forcing them to rewrite their whole handwriting system at the same time. Even if the content is better, the adaptation cost can swamp the gain. And with reasoning models, those habits are probably baked in pretty hard by the original proprietary training mix.
Cody: Exactly. The paper formalizes that by splitting the training objective into capability loss and style loss. Capability tokens are the task-solving parts, like code tokens or numerical content. Style tokens are the discourse glue, things that shape tone or pacing but don’t directly solve the task. Their point is that for reasoning models, optimizing on mismatched style tokens is not harmless. It can interfere with learning the capability tokens you actually care about.
Justy: Which is a pretty good reframing, honestly. A lot of teams treat all tokens like equal training calories. This paper is saying, no, some of those calories are empty and some are actively messing with your lift.
Cody: Yeah. Well... maybe not empty, because style does matter for user experience and consistency. But for benchmark improvement on code, capability is the scarce thing. So their answer is TESSY, where the teacher and student cooperate during data synthesis instead of having the teacher write the whole response.
Justy: You’re doing the whiteboard voice again, so keep going. How does that cooperation actually happen?
Cody: The clean version is this: the teacher writes the parts tied to solving the problem, the student writes the stylistic connective tissue. They alternate. The system has boundary predictors that decide where a capability span ends and a style span begins, then it uses a generate-then-rollback strategy so each model only keeps the tokens it was supposed to own. That rollback piece matters because generation is messy. You let a model run a bit, then trim back to the precise bound
Justy: That’s the part I liked. It’s not just blend logits and pray. They’re actually controlling authorship at the span level. So if the teacher starts producing good code but also starts wrapping it in its own reasoning patter, TESSY can keep the code and toss the extra connective stuff, then hand the mic back to the student.
Cody: Yes. Figure 1 and Figure 2 in the paper basically visualize that division of labor. Teacher owns capability-related spans, student owns style-related spans. Blue and pink for capability, green and purple for style. Under the hood, the alternation is trying to sample capability tokens from the teacher distribution while keeping style tokens close to the student distribution. That’s the whole student-consistent idea.
Justy: And that phrase, student-consistent, is the part people shipping models should care about. If I’m building a coding assistant on an 8B model because I need cost control, I do not want to retrain its whole personality every time I import synthetic data from some monster teacher. I want the gains where users feel them, better code, fewer dumb misses, more reliable reasoning, without weird regressions in response behavior.
Cody: That’s why the result is interesting. TESSY doesn’t just avoid the drop. It swings positive. With that same GPT-OSS-120B teacher and Qwen3-8B student, they report gains of 11.25 percent on LiveCodeBench-Pro and 6.68 percent on OJBench. And they say the gains hold with other teachers too, including DeepSeek-R1 and Qwen3-235B-A22B-Thinking, which matters because it suggests the method is not narrowly overfit to one pairing.
Justy: That starts to feel like it has legs. Not universal magic, but a real recipe. Especially for teams living in the open-model world where the base reasoning model came from one lab, your synthetic teacher comes from somewhere else, and you do not know the original data distribution that shaped either one. That mismatch is the normal case now.
Cody: I think the methodology is sound in the sense that it attacks the right failure mode with a mechanism that matches the diagnosis. My only caution is about token-type separation. In code tasks, capability spans are easier to identify because code and explicit reasoning steps are relatively structured. In more open-ended domains, the line between style and capability gets blurry fast. A phrase that looks like fluff can actually steer reasoning.
Justy: Yeah, that’s not fake skepticism, that’s just scope. For code, math, maybe tool-using agents with structured traces, I can see this working well. For messy consumer chat products where tone and reasoning are tangled together, you’d need better boundary prediction or the whole handoff gets mushy. Still, for enterprise coding tools alone, that’s a real buyer.
Cody: And the paper is refreshingly honest about why this is needed. Open reasoning models are usually tuned in-house on proprietary data, so nobody outside the lab really knows the native output distribution they were stabilized on. If you fine-tune blindly with teacher-only synthetic traces, you can trigger catastrophic forgetting or at least weird degradation. TESSY is basically a way to respect the student’s prior instead of bulldozing it.
Justy: Which, by the way, is a very PM lesson. Users do not care that your training data came from a smarter model if the shipped assistant suddenly feels off, gets less reliable, or starts answering in a bizarre new rhythm. That kind of regression burns trust fast. I’d give the paper a solid A-minus. Strong result, concrete mechanism, and I can see who would use it.
Cody: A-minus? Wow. I bring you alternating generation, boundary predictors, rollback control, and double-digit benchmark recovery, and you still hold back the full A. Brutal. This is going on the weekend project list out of spite.
Justy: That list is now longer than most product backlogs, Cody. But okay, let’s make it useful. If somebody wants to get hands-on, I’d start with the GitHub repo, CoopReason slash TESSY, and actually inspect how they implement the teacher-student switching. Don’t just read the paper, run the synthesis pipeline and look at the generated traces side by side.
Cody: Then grab the dataset on Hugging Face, CoopReason slash TESSY-Code-80K. I’d do a very boring but revealing experiment: train Qwen3-8B or a smaller coding model on teacher-only synthetic data versus TESSY-style data, keep the rest fixed, and watch not just benchmark scores but response-form drift. You want to see whether the model starts sounding unlike itself.
Justy: And if you want a build project, take a smaller open model you can actually afford, maybe a coding-focused 7B or 8B checkpoint, define your own rough capability versus style boundaries for a narrow domain, and see if a lightweight version of this works on internal support scripts or SQL generation. I’m putting that in the backlog. We do not have a backlog. We do, however, have another excuse to test whether synthetic data pipelines should be generation pipelines, not just dum