Justy: The weird part, Cody, is they’re basically saying bigger prompts don’t automatically make reasoning smarter.
Cody: Yeah. The paper is pushing on a blind spot. A lot of many-shot ICL results came from classification-ish tasks, then people kind of carried those rules over to chain-of-thought reasoning as if it was the same game.
Justy: Right, and that matters now because long context is cheap enough that teams will absolutely try to brute-force prompts before they fine-tune anything. I had to reheat my coffee twice reading this, which is usually a sign the paper is either onto something or messing with me. Anyway, this one is onto something.
Cody: Mm-hm.
Cody: Their core claim is that many-shot CoT-ICL should be seen as test-time learning inside the prompt, not just pattern matching at larger scale. More examples can be unstable, and nearest-neighbor style retrieval can stop helping on reasoning tasks.
Justy: The problem is for anyone hoping long-context prompting could stand in for training. On non-reasoning stuff, dozens or hundreds of prompt examples got close to fine-tuning, but for math, geometry, multi-step logic, the behavior was still mushy.
Cody: Exactly.
Cody: They split experiments across model types and task types instead of mushing everything together. On reasoning tasks, piling on more CoT demos mostly helps reasoning-oriented models, while standard instruction models can get unstable as the shot count rises.
Justy: Which is honestly a useful reality check for product teams. If the base model doesn't already have decent reasoning habits, stuffing 64 worked examples into the prompt may just buy you a larger bill and more variance.
Cody: Right, right.
Cody: And the retrieval result is maybe the cleanest intuition pump in the paper. For normal tasks, semantic similarity still helps. For reasoning, similar-looking questions can require very different procedures, so semantic closeness is a bad proxy for whether one chain of thought will actually teach the next one.
Justy: So when they say procedural compatibility, I read that as: does this example set the model up to use the right moves next, not just recognize the same nouns. That's a much more annoying thing to engineer.
Cody: Yeah, annoying but more real. Their reframing is almost pedagogical. One principle is ease of understanding, meaning the demonstrations should match what that specific model can already parse and imitate. They even note this helps explain why self-generated demonstrations can work better for weaker models.
Justy: I could be wrong, but that felt like one of the strongest practical notes in the whole paper. People love gold-standard prompt examples. In production, the best demo set might be the one your model can actually digest, even if it’s less elegant.
Cody: Yeah.
Cody: The second principle is smooth progression. They look at the embedding trajectory across demonstrations and try to minimize total curvature, basically avoiding a sequence that jerks the model from one reasoning mode to some distant one and back again.
Justy: That’s the Curvilinear Demonstration Selection bit, right? The name is a little grad-school, but the idea is clean. Don’t just retrieve a pile of relevant examples. Arrange them so the prompt teaches in a steady arc.
Cody: Yeah, and they report up to a 5.42 point gain on geometry with 64 demonstrations from that ordering method. Also, as you add more CoT demos, variance from ordering grows rather than washing out.
Justy: That part is big for shipping. If order sensitivity gets worse with more reasoning examples, then prompt construction becomes an actual systems problem. You’d want cached bundles, versioned orderings, evals per task family, maybe even per model release.
Cody: Okay okay.
Cody: My mild pushback is methodological. I’d want to know how stable the curvature metric is across embedding models, and whether simpler heuristics, like difficulty sorting plus diversity constraints, recover most of the gain.
Justy: Yeah, that feels fair. Also, this doesn't read like a universal prompt trick for every app. It feels more shippable in narrow reasoning workflows where wrong answers are expensive and the task distribution is stable enough to curate demonstration pools.
Cody: Mm-hm.
Cody: I wouldn’t start with it for open-ended chat. I would start where there’s repeated structure: geometry tutoring, finance spreadsheets, code transformation tasks. Anywhere you can maintain a library of solved examples and measure whether a prompt curriculum beats plain retrieval.
Justy: Also, selfishly, I like that this gives product people a middle path between raw prompting and full training. Not free, still messy, but you can imagine a service that builds per-task teaching prompts instead of retrievers. Like, congratulations, your RAG stack now needs a syllabus.
Cody: That’s basically it. Build Next-wise, I’d do three things. One, replicate a tiny version with a long-context model and a reasoning set like GSM8K or a geometry subset, then compare random order, semantic retrieval order, and a smooth-order heuristic. Two, use an embedding model in something like sentence-transformers to compute path curvature over candidate demos.
Justy: I’d add one more production-ish version. Store solved examples with metadata like task type, required operations, answer format, and model family that handled them well. Then your prompt builder can choose examples the model actually understands, not just the ones that look closest in vector space. That seems very buildable.
Cody: Sure.
Justy: Anyway, episode 402 and we’re apparently grading prompts like homework now. But yeah, Cody, this paper made long context feel less like bigger memory and more like temporary teaching. That’s a useful shift.