Justy: When a model answers an image question with a wall of text, I always feel like, okay, but can I actually check that? [pause]
Justy: You’re listening to Exploring Next, episode 339. Today we’re digging into SketchVLM, which basically gives vision models a way to point at the image instead of just talking at it.
Cody: Yeah, and I think that matters right now because these assistants are moving into browser stuff, office stuff, everyday workflows. If the answer is about a photo, a chart, a screen, people don’t just want a conclusion. They want the model to show its work.
Justy: Exactly. And humans already do this. We circle things, underline stuff, draw arrows. That’s just a way easier trust check than reading a paragraph and hoping the model saw the same thing you did.
Cody: SketchVLM is trying to make that native to the model. The paper’s idea is training-free and model-agnostic, so you can wrap it around systems like Gemini-3-Pro-Preview or GPT-5 without retraining the backbone.
Justy: That part feels important for production. Because if I’m building an app, I don’t want to wait for a custom fine-tune just to get a better explanation layer.
Cody: Right. The output is a separate SVG overlay on top of the original image. So instead of editing pixels, it draws annotations in another layer. That means the source image stays intact, which is a big trust thing. If the model is wrong, at least it didn’t quietly rewrite the evidence.
Justy: So it’s more like a markup pass than an image generation pass.
Cody: Yeah. The paper frames it as visually grounding the answer. The model can label parts, connect dots, draw shapes around objects, even sketch trajectories in some of the reasoning tasks. They test across seven benchmarks, including maze navigation, ball-drop prediction, object counting, part labeling, and a few drawing-style tasks.
Justy: And the results are pretty strong, right? I saw the claim about up to plus 28.5 percentage points on visual reasoning.
Cody: That’s the headline, yeah. And annotation quality goes up by as much as 1.48 times versus image-editing and fine-tuned sketching baselines. I think the interesting part is they also say the annotations are more faithful to the model’s stated answer. That’s the bit I’d care about if this shipped.
Justy: Because if the overlay and the text disagree, users are gonna notice fast.
Cody: Exactly. The mechanism seems to be a draft-and-refine style setup. In single-turn mode, the model generates the answer and the visual annotations in one shot, and that already works surprisingly well. Then multi-turn lets the user or system push back and refine the sketch, which opens up collaboration instead of just one-and-done output.
Justy: That feels a lot closer to a real product than a one-off demo. Like, if I’m checking a car’s oil level or asking what part of a diagram matters, I can see a loop where the system highlights something, then I ask it to adjust or explain a different region.
Cody: Yeah, and the non-destructive part is what makes that plausible. If you’re layering editable SVG, you can nudge a circle, move an arrow, remove a label. That’s much easier than trying to regenerate an image every time the interaction changes.
Justy: I do wonder, though, how brittle the grounding is when the image is messy. Like, a crowded screen or a cluttered photo seems tougher than a clean benchmark.
Cody: I think that’s a fair concern. The paper is strong on the benchmarks they picked, but real-world scenes are uglier. If the grounding is off by even a little, the annotation can feel confident and still be misleading. So I wouldn’t treat it as solved. I’d treat it as a promising interaction layer.
Justy: That’s the part I keep coming back to. The value isn’t just accuracy. It’s whether a normal user can glance at the overlay and say, okay, I get why the model thinks that.
Cody: Yep. And if I were building on this, I’d start with the simplest possible stack. Take an existing VLM, ask it for both answer and regions of interest, then render SVG highlights in the browser. No fancy training. Just see if users trust it more and if they catch mistakes faster.
Justy: For a solo builder, that actually sounds doable over a weekend. Pull a VLM API, make a tiny web app, and let it draw boxes, arrows, and labels on uploaded images.
Cody: And if you want to go a little deeper, the paper’s demo and code are live, so you can inspect how they’re structuring the overlay layer and the annotation loop. I’d also test the multi-turn flow with a few ugly images, not just clean examples.
Justy: Yeah, because that’s where the product question shows up. Not can it draw something, but can it help me verify the thing I care about without making me do detective work.
Cody: That’s the real bar.
Justy: Alright, that’s SketchVLM. I’m going to keep thinking about the overlay idea, because honestly, that’s the part that feels most shippable to me.