Justy: So I had one of those mornings where I opened my laptop, stared at three tabs, and realized I’d already forgotten what I was supposed to be doing. [chuckles] Anyway, this paper is kind of about that feeling, but for models.
Cody: Yeah, and the annoying part is it’s a real gap. Models can look great when the answer is already somewhere in pretraining, but once the useful stuff lives inside a long doc or a weird internal manual, they start acting like they never saw it. This paper calls that context learning, and I think that framing is right.
Justy: That’s the piece that matters to me. If I’m building something for support, ops, onboarding, whatever, I don’t want to retrain the whole model just because one policy doc changed. I want it to read the room, basically, and use the context like it actually means something.
Cody: Right, and the paper’s claim is pretty specific. Instead of asking a model to magically infer everything from raw context, they turn the context into explicit natural-language skills. So the model gets procedures, rules, little decision habits. That’s a lot closer to how people actually work through dense instructions.
Justy: Okay, but writing those skills by hand sounds like a nightmare. If the context is fifty pages of technical junk, nobody’s sitting there distilling it into perfect prompts all day. [laughs] That already sounds dead on arrival for production.
Cody: That’s exactly the bottleneck they’re trying to dodge. They say there are two big problems: manual skill annotation is too expensive, and automated skill construction usually has no external feedback. So Ctx2Skill sets up a self-evolving loop. A Challenger generates probing tasks and rubrics from the context, a Reasoner tries to solve them using the current skills, and a neutral Judge gives binary feedback.
Justy: So the model is kind of testing itself against its own homework. Which, honestly, feels very on brand for research. But I get why that’s useful if the context is changing fast and humans can’t keep labeling edge cases.
Cody: Yeah, but the neat part is they don’t stop at just one loop. When the Reasoner fails, a Proposer and a Generator look at those failures and synthesize new skill updates in plain language. And they do that for both sides, meaning the Challenger also evolves. So the task generation and the skill set co-adapt instead of one staying fixed while the other gets stronger.
Justy: That’s the part that sounds shippable to me, weirdly enough. Not the whole research stack, obviously, but the idea of an inference-time layer that keeps extracting better procedures from the same docs. If I’m shipping a product, I’d rather do that than ask for a full fine-tune every time the source material moves.
Cody: Same. Though I do think there’s a real risk of the loop getting too clever for its own good. If the Challenger keeps making nastier and nastier cases, you can end up optimizing for extreme failures instead of the actual distribution. The paper calls that adversarial collapse, and I think that’s a real concern, not hand-wavy skepticism.
Justy: Yeah, because then you’ve built a system that’s brilliant at its own torture chamber. [chuckles] Not exactly what I want sitting behind a customer-facing workflow.
Cody: Exactly. Their answer is Cross-time Replay. My read is that they keep track of skill sets over time and pick the one that balances performance on representative cases, not just the most aggressive recent failures. So instead of letting the system drift toward weird specialization, they try to recover the skill set that looked best across a broader slice of the problem.
Justy: That’s a nice product instinct, honestly. Because in the real world, the weird edge case is always screaming the loudest, and meanwhile the boring common case pays the bills. If the system only gets sharper on synthetic hard mode, you’ve probably lost the plot.
Cody: And the results are at least believable, which matters. On CL-bench, they report gains on four context learning tasks, and the numbers move across different backbones. GPT-4.1 goes from 11.1 to 16.5 percent, and GPT-5.1 goes from 21.2 to 25.8. Not magic, but enough to say the skills are doing something real.
Justy: Do you think this is research-only, or could somebody actually bolt it into a product? Because I’m hearing a lot of moving parts, but not necessarily a science project that dies in a notebook.
Cody: I think it can be both. The full multi-agent loop is definitely researchy, because you need the Challenger, Reasoner, Judge, and the failure-to-skill machinery. But the output is just skills in natural language, which means you can plug them into any model at inference time. That part is practical. I’d probably start with a narrow domain, like support macros or internal troubleshooting, and see whether the generated skills improve consistency before I touched anything broader
Justy: That’s the version I’d try too. Also, this is one of those days where my coffee is doing more work than I am, so I’m trying to keep the bar realistic. [sighs] But yeah, a weekend build could be as simple as taking a small docs set, generating probing questions, and seeing whether explicit skills beat raw context stuffing.
Cody: Totally. If I were hacking on it solo, I’d use a repo with a retrieval eval harness, a small corpus like product docs or API docs, and a script that alternates between task generation, answer attempts, and failure summarization. Even without the full paper’s agents, you could prototype the basic idea: extract procedures from failures, then re-run the same tasks with those procedures injected.
Justy: And if it works at all, that’s already useful. Because the real win here isn’t some grand model personality shift. It’s getting from messy context to a reusable playbook without making a human write the playbook every time.
Cody: Yeah, that’s the cleanest version of it. The paper’s basically arguing that context doesn’t have to stay raw. You can compress it into skills, keep refining them through feedback, and make the model a lot better at the exact kind of work companies keep asking it to do.
Justy: Alright, that’s the one I’d keep an eye on. Context into skills, skills back into the model, and hopefully fewer moments where the thing looks at the doc and just shrugs. [laughs]