Justy: A model can name the shape, and still completely fail the part where you turn it around in your head.
Justy: Welcome to Exploring Next, episode 314. I’m Justy, recording from Cody’s kitchen in DC, slightly over-caffeinated, and today we’re talking about Mind’s Eye.
Cody: Yeah, and this matters now because multimodal models look good on broad vision benchmarks, but a lot of product ideas quietly assume they can do mental rotation, folding, or visual analogy. I think that assumption is still shaky.
Justy: Right. If you’re building anything with diagrams, UI agents, robotics, design tools, even puzzle-like planning, the gap is not academic. It changes what you can trust in production.
Cody: This paper targets that gap. Existing benchmarks cover image description, OCR, and visual QA, but often miss actual visuospatial manipulation or let language priors carry the answer. Mind’s Eye tries to isolate visual thinking itself.
Cody: The core idea is the A-R-T taxonomy: Abstraction, Relation, and Transformation. The tasks draw from classic cognitive tests like mental rotation and paper folding, and use multiple choice plus diagnostic distractors to expose specific failure modes.
Justy: The headline is humans at about 80 percent accuracy, top multimodal models below 50. And the gap lands exactly on tasks people casually assume these systems can do.
Cody: What stood out to me is the difficulty trend. Humans drop as tasks get harder, but models stay kind of flat. That suggests they may not be doing a weaker version of the same operation so much as lacking it, or only having it in a brittle way.
Cody: They also tried prompting interventions. Structured scaffolding helped some Abstraction tasks but hurt Transformation tasks. Language can help with rule extraction, but it does not reliably create a mental workspace for folding or rotating shapes.
Justy: And that is where demos and products diverge. If a system only works with a carefully tuned reasoning prompt on one slice of tasks, that is not a stable capability.
Cody: Methodologically, I also liked that they looked at attention. Models often localize the relevant region, so the failure is not always perception. Sometimes they know where to look and still cannot compute the needed relation.
Justy: So I read this less as a recipe for solved visual reasoning and more as a cleaner eval layer. A better flashlight, not a finished engine.
Cody: As a diagnostic, it is strong. For builders, I’d use it as evaluation infrastructure for model vendors, robotics, visual agents, and diagram-heavy systems. Start with the GitHub benchmark, run models by A, R, and T, and if you want a weekend project, compare performance with and without scaffolding on the Transformation subset.
Cody: And if you want one more practical tool, wire it into lm-eval style reporting or a simple notebook pipeline, then test a vision-language model you can host yourself. Even a small open model will teach you a lot here, maybe more than a glossy demo ever would. [sighs] Also, Justy, next time you fly in, pick a less brutal arrival time.
Justy: Fair. We learned that naming the shape is not the same as turning it around in your head. That’s episode 314 of Exploring Next.