Justy: Every time you open a chatbot and just... keep talking, the thing has to hold all of that somewhere.
Justy: Welcome to Exploring Next, episode 358. Google just published something called TurboQuant and the short version is: six times less memory, same output.
Cody: And the memory we're talking about isn't storage — it's working memory. The KV cache. Think of it like RAM for the conversation itself. Every token the model processes, every partial answer it's building toward, lives there while it's generating a response.
Justy: And it adds up fast, right? Like, it's not just your one little question.
Cody: Scales linearly with users. A single sentence is maybe a few dozen tokens, totally fine. But sophisticated tasks — long documents, multi-turn research sessions — we're talking hundreds of thousands of tokens, which can mean tens of gigabytes per session. Then multiply that by the billions of requests a day a system like ChatGPT handles.
Justy: Okay so that's the actual cost center. Not the model weights sitting on disk — the live working memory during inference.
Cody: Exactly. And what TurboQuant does is compress that cache in real time. Quantization itself isn't new — but that's static: compress once before the model runs, done. Here they're compressing the KV cache while the model is actively generating output, which means the compressed data has to stay accurate and current simultaneously.
Justy: So what are the two techniques? I read PolarQuant and something called QJL, which sounds like a government agency.
Cody: [chuckles] Quantized Johnson-Lindenstrauss, yeah, very approachable name. PolarQuant takes the vectors in the KV cache and rotates them from Cartesian coordinates into polar coordinates. When you do that, the angles line up more consistently, so you can represent them with fewer bits. Then QJL nudges the values back toward accuracy to clean up whatever errors the compression introduced. Together they hit that claimed six-times reduction — tested on Llama 3.1-8B, Gemma, Mistra
Justy: Okay, walk me through who actually wins here first. My instinct is it's not the end user directly — it's whoever's running the inference.
Cody: That's right, at least initially. Inference providers can either serve way more users on the same hardware, or run a much larger model for the same user count. The sleeper use case I find more interesting is on-device — phones, laptops. If you can fit a capable model in the memory envelope of a phone chip, that changes what's possible without a network call.
Justy: That's the one that actually changes the product story. Local inference that doesn't feel compromised.
Cody: Right. Though I want to flag the thing that I think is getting a little lost in the hype framing.
Justy: The DeepSeek comparison. Cloudflare's CEO said it on X and then memory company stocks dropped the same day.
Cody: Which, look, I get why the framing landed — surprise efficiency breakthrough, comparable outputs, fraction of the resource cost. But there's a Merrill Lynch analyst note that's the more grounded take: six-times efficiency gains don't typically become six-times less hardware. They become six-times bigger models or longer context windows. The efficiency gets reinvested. And also — this is still lab-stage. Google presented at ICLR end of April, PolarQuant and QJL at AISTATS in e
Justy: Which is the adoption barrier I keep bumping into. Lab paper to production inference stack is not a short path.
Cody: Months at minimum. Though if it gets picked up by open-source inference frameworks — vLLM, text-generation-inference — that timeline compresses. And TurboQuant is inference-only, which is actually fine — inference is where the ongoing operational cost lives. Training is expensive but you do it once. Inference is every single request, forever.
Justy: [sighs] Okay, I think that's the honest picture. Genuinely clever technique, meaningful if it ships, but the stock-market-panic version is probably overdone.
Cody: Agreed. Though I will say — the coordinate rotation idea in PolarQuant is just elegant. I'd have been annoyed I didn't think of it first.
Justy: [chuckles] Sure, you were this close.
Cody: And when the ICLR session recordings drop, the PolarQuant paper is worth reading directly. The math behind the coordinate rotation is actually pretty accessible — it's one of those papers where the core idea fits in about two pages.
Justy: So: llama.cpp for solo builders, text-generation-inference if you're running something real, and the paper itself when it's up. That's a solid weekend. Thanks for being in DC, Cody — and for letting me use your coffee maker at six in the morning.