Justy: Okay, so I saw this piece about local LLM agents and my first thought was, sure, download some weights, fire up Ollama, you're done. Apparently not.
Cody: Yeah. The author built a scientific agent for single-cell RNA analysis — like, raw biological data in, full pipeline out. And the headline is basically: local hosting is the easy part. The infrastructure is what kills you.
Justy: Right. And they actually tried cloud APIs too, right? Claude, GPT — but those just... hide the problem.
Cody: Exactly. When you use Claude Code or whatever, all the context management, the state, the crash recovery — someone else already built that. You don't see it. The moment you host yourself, it all becomes your problem.
Justy: So what actually broke first?
Cody: Speed. Pure latency. Their agent loop does fifty to eighty tool calls, and every single one carries this fixed baggage — system prompt, tool schemas, the whole conversation history. Thirty-six thousand tokens before the model even gets to the new stuff.
Justy: Thirty-six K, every single call.
Cody: Every single one. Ten to fifteen seconds each. And then eventually it just crashes with a context overflow error and you lose all the in-memory state.
Justy: Oh that's brutal. So you're waiting minutes for something that should feel instant, and then it dies anyway.
Cody: Yeah. So the article splits into two fixes. First half is making inference faster with vLLM optimizations — prefix caching, chunked prefill, swapping attention backends. Second half is keeping sessions alive through better context management and this structured world state that survives trimming.
Justy: Prefix caching — that's the big one, right? Not re-computing attention for the same token sequence every time?
Cody: Right. The system prompt and tool schemas are identical on every call, so vLLM can cache that KV state and just... not redo the work. It's obvious in retrospect but you have to actually configure it, test it, measure it. They ran everything on A100s and H100s with benchmarks for each change.
Justy: And the context survival thing — that's less about speed, more about...
Cody: Not losing your work. The author makes this point that scientific workflows need reproducibility and provenance tracking. Like, which cells were filtered, what clustering resolution produced which result. That can't live in a chat log that might get compacted or lost. You need explicit world state.
Justy: Which is such a product insight, honestly. Everyone talks about agents replacing workflows but nobody wants to talk about the audit trail.
Cody: Yeah, and the author basically says: skills are just prompts, prompts can be overridden or ignored. For real science you need structured records that outlive any single session.
Justy: So who should actually care about this? Is this... every agent builder, or just the bioinformatics people?
Cody: I think if you're building any agent that runs for more than a few turns and touches real data, this is your future. The cloud API path works until it doesn't — until you need reproducibility, or cost predictability, or your data can't leave your network. Then you're building this anyway.
Justy: Fair. Though I do wonder if some of this doesn't get solved by better cloud APIs eventually. Like, if Anthropic just shipped provenance tracking...
Cody: Maybe. But the latency problem doesn't go away if you're remote. And for fifty-plus tool calls, even small network overhead adds up fast.
Justy: True. Okay, the models they mention — Qwen three point six twenty-seven B, Gemma four thirty-one B. Those are the ones that made local viable?
Cody: Those are the recent open-weight releases they call out. The claim is that open models are getting genuinely useful for structured, tool-driven workloads. Which, if true, changes the economics pretty dramatically.
Justy: That's the part that excites me, honestly. Not the vLLM tuning — sorry, I know you love it — but the idea that the open models crossed some threshold where the infrastructure pain is worth it.
Cody: The infrastructure pain is real though. This is not a 'pip install and you're shipping' situation. The author had to dig into inference server configs, GPU memory management, attention backend selection...
Justy: Yeah, I'm not rushing out to build this. But I'm glad someone wrote it down. Four hundred and thirty-eight episodes, Cody. We finally found the person who actually measured the thing.
Justy: Anyway. If you're building agents for real work — not demos, actual work — this is worth reading. The benchmarks alone. That's my take.
Cody: Agreed. And if nothing else, it's a good reminder that 'local LLM' stops being a weekend project the moment you need it to survive a long session without catching fire.
Justy: Beautiful. Another happy Wednesday in the books.