Justy: Exploring Next, episode 357. Gemini Embedding 2 is interesting because it finally treats the messy stuff people actually have, like screenshots, docs, audio, and video, as one search problem.
Cody: Yeah, and that matters right now because a lot of teams are still duct-taping separate pipelines together. One index for text, one for images, maybe a whole different thing for video, and then the retrieval quality gets weird fast.
Justy: Right. The user story is basically, I need to find the thing in the folder, but the thing might be a slide deck, a screen recording, or a PDF someone exported badly.
Cody: Exactly. Gemini Embedding 2 is the first embedding model in the Gemini API that maps text, images, video, audio, and documents into a single embedding space, and it supports over 100 languages. That single space is the whole trick.
Justy: And they’re not hand-waving the input limits either. It can take up to 8,192 text tokens, six images, 120 seconds of video, 180 seconds of audio, and six pages of PDFs in one call.
Cody: The more interesting bit is interleaved inputs. So you can send text plus an image together, not just separately, and get one embedding that reflects both. That’s better for real data than pretending modalities live in separate universes.
Justy: That feels like the adoption unlock for product teams. If I’m shipping internal search for a support org or a design team, I don’t want to explain why screenshots are missing from results.
Cody: Yeah, and for agents it gets even more useful. The post calls out agentic retrieval-augmented generation, like scanning hundreds of files to fix a codebase or cross-referencing a pile of PDFs. That’s the kind of task where context gets spread everywhere.
Justy: So the market here is basically anyone with a gross knowledge base. Enterprises, dev tools, maybe media teams too. The barrier is less about wanting it and more about whether you’ve got the data plumbing to feed it cleanly.
Cody: And whether you can evaluate it. Unified embeddings sound neat, but you still need a retrieval benchmark, because if it’s ranking the wrong slide or the wrong clip, the model on top just gets confidently lost.
Justy: Google tries to help with task prefixes too, which I thought was clever. You can shape the embedding for task: question answering, fact checking, code retrieval, or search result.
Cody: Yeah, that’s a nice asymmetric retrieval move. You prefix the query one way and the document another way, like title plus text for the doc side. It’s a small detail, but it usually matters a lot more than people want it to.
Justy: So the model isn’t just making one blob and hoping for the best. It’s nudging the embedding toward the job you actually want it to do.
Cody: Right. The trade-off is you’re adding a little structure and opinion into the pipeline, which is good if it matches your retrieval goal and annoying if your corpus is chaotic. But honestly, chaos is already there.
Justy: [chuckles] Fair. The other practical thing is they mention the Batch API if you want separate embeddings for individual inputs instead of one aggregated vector.
Cody: That’s useful if you’re doing something like index-time processing where each asset gets its own representation. The post says support is coming soon for Agent Platform there, which tells me Google expects people to use this in pipelines, not just toy demos.
Justy: If I were building with this, I’d start with a tiny mixed corpus. A few PDFs, a couple screenshots, maybe one audio clip, and see if one multimodal index actually beats the Frankenstein version I already have.
Cody: That’s the right weekend project. Use the Gemini API `embed_content` call with a text string and an image byte payload, then compare retrieval quality against a text-only baseline. If you want to get fancier, add the task prefixes on both sides and see if recall improves.
Justy: And if you’re on a product team, this is probably the first time multimodal RAG feels less like a research demo and more like something you can hand to users without a giant apology.
Cody: Yeah. Not magical, but a lot less glue code. Which, for once, is the good news.
Justy: That’s enough for today. Exploring Next, episode 357. We’ll be back with more stuff that actually helps you ship, Cody.