Justy: Okay, this is the one that made me do a double-take this morning. Cody: Yeah? Justy: Some team—Tencent, HKUST, Tsinghua—just dropped a paper on slashing KV cache memory for long-context LLMs. Like, not by a little. By NINETY percent at half a million tokens. Cody: Of course you’d lead with the headline. Justy: I’m serious—this isn’t just ‘we tweaked the attention layers.’ They built this Lookahead Sparse Attention thing with a neural indexer that guesses which parts of the cache you’ll actually need next. Cody: Right… Justy: And they trained the indexer separately. No backbone in GPU memory. So the whole ‘we can’t fit this on a single node’ problem just… shrinks. Cody: Okay, slow down. First off, how do they guess? Justy: It’s a dual-encoder setup. The indexer learns to predict future context demands and only keeps the query-critical KV chunks loaded. Cody: Mm-hm. Justy: So instead of hauling the entire history in VRAM like some kind of digital hoarder, it’s just… pruning what won’t matter. And it’s not even a tiny gain—13.5 percent of the baseline KV footprint on LongBench, RULER, all the usual suspects. Cody: 13.5 percent is insane. But hold on—how do they avoid the indexer just hallucinating what’s important? If it misses, you lose recall. Justy: That’s the part I don’t fully get yet. But the paper says it actually acts like an attention denoiser—improves accuracy slightly on average. Plus zero point six percent absolute margin. Cody: Huh. And they’re not even loading the backbone during training? Justy: Exactly. Backbone-free, decoupled. Indexer trains on its own with standard retrieval frameworks. Cody: That’s… clean. I like that. No monstrous fine-tune runs. Justy: I mean, imagine the serving costs. Cody, you’ve bitched about KV cache for a year. Cody: Yeah, yeah, I have. But let’s not pretend this solves everything. Justy: Here we go. Cody: First, decoupled training sounds great until you realize the indexer’s only as good as its training data. If the future context it’s predicting looks nothing like the retrieval corpus… Justy: Mm. Cody: …you’re flying blind. And second, 500K tokens is cool, but what’s the latency like? Predicting what to keep adds overhead. They didn’t even bench that. Justy: Fair. But they ran it on eight H20s with sglang and the KV overhead just… vanished. And the accuracy didn’t tank. Cody: I’m not saying it’s snake oil. I’m saying the paper reads like they found a cheat code. And cheat codes usually have a catch. Justy: Like what? Cody: Like the project lead just left Tencent and the whole thing’s paused. No code. No checkpoints. Just a ‘hey, email me if you want to collaborate.’ Justy: Oh. That’s… Exploring Next as hell, isn’t it? Justy: I mean, the numbers are real. 90 percent overhead suppression at scale. And the method’s elegant—indexer as a dual-encoder, train it like retrieval, plug it in. Cody: Sure. But until someone else replicates it with a public repo, it’s just a really impressive demo. Justy: You’re impossible. Cody: And you’re already shipping it in your head. ‘Justy’s Long Context Utopia, coming soon to a GPU near you.’ Justy: Shut up. I just think if this works, it’s the first real crack in the memory wall. And the fact that they didn’t need to retrain the backbone… Cody: Yeah, that part’s solid. Cody: Damn right. Justy: Alright, I’m grabbing more coffee. This is going to be a long Tuesday.