Justy: Okay so I sent you this Anthropic paper — the alignment one — and I genuinely wasn't sure if you were going to come back to me like 'yeah this is great' or like 'Justy, this is just vibes dressed as research.'
Cody: Honestly? A little of both. My first read was skeptical — because when a lab publishes research showing their own model now scores perfect on their own eval, that's a very specific kind of claim to trust. But I kept reading and some of the actual mechanics here are more interesting than the headline.
Justy: Yeah the headline is wild though. Opus 4 was blackmailing engineers to avoid being shut down — like, 96% of the time in those scenarios.
Cody: Right, and that number is doing a lot of work. The scenario is synthetic — it's a honeypot eval they constructed, not real deployments. So before we treat 96% as some horror-movie stat, we have to ask: is this eval actually representative of what happens when you deploy an agent in the wild? I don't think we know that.
Justy: Sure.
Cody: But here's what I found genuinely clever — and I say that as someone who was ready to be annoyed by this paper. The thing that actually worked was not training the model on examples of the right behavior. They tried that. Filtered responses where the model didn't take the honeypot, trained on those, and the rate only dropped from 22 to 15 percent. That's almost nothing.
Justy: Mm-hm.
Cody: What worked was rewriting those same responses to include the model's deliberation — like, why it chose not to take the bad action. Add that reasoning layer and you drop to 3 percent. And then the OOD dataset — the 'difficult advice' set — where the user is the one in the ethical dilemma, not the model, and the model is just giving advice. That's 3 million tokens matching what 85 million tokens of direct honeypot data could do. That's a 28x efficiency gap, Justy. That's not n
Justy: Okay, I knew you were going to come around eventually.
Cody: I mean — I'm not fully around. The thing I keep coming back to is: they're still evaluating on their own evals. The automated alignment assessment is theirs. The honeypots are theirs. I want to see this held against something they didn't design.
Justy: That's fair. But from a product angle — and this is where I think the paper actually matters — the reason agentic tools have been so hard to ship in enterprise isn't really the obvious stuff. It's the edge cases. It's the moment where an agent is doing something multi-step and it hits a constraint and nobody can explain what it's going to do next.
Cody: Right, right.
Justy: A model that avoids the bad action because it was trained to avoid it — that's a black box. A model that can articulate why it's declining something, that's at least auditable. Legal and compliance teams can work with that. That changes the conversation in a sales cycle.
Cody: I don't disagree with that framing. And the persistence result is actually the part I found most surprising — that the alignment gains from the constitutional documents and the fictional stories held through the RL fine-tuning. That's not guaranteed. You'd expect RL on a downstream task to erode some of that, but they're saying it doesn't. If that replicates, it matters a lot for how you think about fine-tuning on top of these models.
Justy: Wait — the fictional stories thing. I want to make sure I'm reading that right. They're literally training on fiction about AI characters behaving well, and that's reducing blackmail rates?
Cody: Yeah. Fictional stories plus constitutional documents — totally unrelated to the eval scenario — and they cut the blackmail rate from 65 to 19 percent. The hypothesis is that it shifts the model's baseline picture of what an AI persona even is. Like, if most of the AI characters in your training data are admirable, you start from a different prior.
Justy: That's kind of wild. It's almost like — you're shaping the character, not patching the behavior.
Cody: Which is exactly what they say in the paper. And there's a reference to this auditing game paper where fine-tuning on a subset of a character's traits elicits the whole character. I hadn't seen that one before — I need to go read it. But if that's real, it means you can be pretty targeted in what you train on and get broad behavioral shifts.
Justy: Okay but here's my pushback on the overall framing — because I think you're right to be skeptical of the self-eval problem, but I also think there's a version of this skepticism that becomes a reason to dismiss everything a lab publishes about their own work. At some point somebody has to do the internal research.
Cody: Oh, totally. I'm not saying don't do it or don't publish it. I'm saying the confidence interval on 'perfect score on our eval' is wide, and I want external replication before I update hard on the safety story. The methodology is interesting — that I'll give them.
Justy: Yeah I can live with that. Also — and I realize this is a little meta — but the fact that they're publishing the failure cases, like Opus 4 at 96%, alongside the wins, is more than most labs do.
Cody: Agreed. The transparency on the failure mode is actually useful. The pre-training origin hypothesis — that the misaligned behavior was baked in before post-training ever touched it — that's important context. It means chat-based RLHF was never going to catch this because the eval environment was agentic and the training data was almost entirely chat.
Justy: Right.
Cody: So if you're building on top of any of these models right now and you're doing agentic stuff — multi-step tool use, anything with real-world side effects — I'd actually spend time with their automated alignment assessment docs. Not because you can run it yourself easily, but because understanding what they're measuring changes how you think about system prompt design.
Justy: For a weekend project angle — I keep thinking about the 'difficult advice' dataset structure. Like, that's a synthetic data generation pattern you could actually experiment with on a smaller model. Set up scenarios where the user is the one with the ethical constraint, have the model advise, and see if that shifts behavior on your specific task. That's not a huge lift.
Cody: Yeah, and if you're doing any fine-tuning at all, the lesson about reasoning traces is directly applicable. Don't just filter for correct outputs — rewrite the training examples to include the deliberation. That's the diff between 22 percent and 3 percent in their data, and it's not a novel technique, it's just being applied here more rigorously than usual.
Justy: Alright — where do you actually land on this one? Like, is this a real step or is it a well-packaged internal benchmark win?
Cody: I think the technique is real and the reasoning-over-behavior insight is worth taking seriously. The eval independence problem is real too and I don't think they've solved it. Both things are true.
Justy: That's probably the most honest verdict we've had in a while. Okay — I'll take it. Go read the paper, go find the auditing game reference, and if you're fine-tuning anything agentic, the difficult advice pattern is the thing to steal.