Izzo: If you're running AI in production at a bank, your prompts can't leave the building.
Izzo: You're listening to Exploring Next, episode two twenty-nine. I'm Izzo, here with Boone, and we're diving into something that sounds like marketing fluff but actually isn't — Langsmart just published the first real p95 benchmarks for on-premises AI gateways.
Boone: And they're throwing down a gauntlet to the whole industry with 'show me the p95' — which honestly, as someone who's debugged production systems at 3am, I love that energy.
Izzo: Right? Because here's why this matters right now — every Fortune 500 is scrambling to deploy AI, but regulated industries like banking and healthcare can't just pipe their data through OpenAI's API.
Boone: They need these AI gateway things that sit in their own data centers, but nobody's publishing real performance numbers. It's all been 'trust us, it's fast' until now.
Izzo: So Langsmart tested their Smartflow platform with a Fortune 200 bank and got some wild results. Boone, break down what they actually measured here.
Boone: They deployed this thing as a Docker container on basically a laptop — 4 vCPUs, 8 gigs of RAM — and got responses down from 2.2 seconds to 220 milliseconds. That's a 10x speedup.
Izzo: Hold on, 2.2 seconds to 220 milliseconds? That's not just faster, that's crossing the line from 'users notice the delay' to 'feels instant.'
Boone: Exactly. And the p95 latency — meaning 95% of requests finish faster than this — was under 300 milliseconds. For context, most enterprise SLAs want sub-500ms, so they're beating that comfortably.
Izzo: But how does semantic caching actually work? Because this isn't just storing exact query matches, right?
Boone: Right, that's the clever bit. Traditional caching is like 'if you ask the exact same question, here's the exact same answer.' Semantic caching is more like 'if you ask a similar question, here's a similar answer that's probably good enough.'
Izzo: So it's doing some kind of similarity matching on the meaning, not just the text string.
Boone: Exactly. They're probably using embeddings to represent the semantic meaning of prompts, then doing nearest-neighbor search to find cached responses that are close enough. At 0.95 similarity threshold, they got 40-50% hit rates.
Izzo: That hit rate is actually impressive for semantic matching. But Boone, what's the architecture trade-off here? Why can't these banks just use cloud-hosted gateways?
Boone: Data sovereignty. When you're a bank processing loan applications, you legally cannot send that data to a third-party cloud service. It has to stay in your network perimeter.
Izzo: So the whole value prop is 'we give you cloud-level performance without your data leaving the building.' That's a real product-market fit for regulated industries.
Boone: And the fact that they're doing it on such modest hardware is interesting. 4 vCPUs and 8GB RAM is like... I've got more compute power in my weekend project server.
Izzo: Which makes me wonder about the competitive landscape. Who else is playing in this on-premises AI gateway space?
Boone: That's where their 'show me the p95' challenge gets spicy. They're basically saying all the other vendors are making performance claims without publishing real latency percentiles.
Izzo: Smart move. It's like when Netflix started publishing their chaos engineering results — suddenly everyone else looked less transparent by comparison.
Boone: Right, and p95 latency is what actually matters in production. Average response time means nothing if 5% of your users are waiting forever.
Izzo: From a go-to-market angle, this feels like they're trying to create a new standard for how enterprise AI infrastructure gets evaluated. Make benchmarks a competitive requirement.
Boone: Which is good for buyers but probably terrifying for competitors who've been handwaving their performance claims. Now they have to put up actual numbers.
Izzo: The Docker deployment model is smart too — it's not some custom appliance you have to rack and stack. Just docker run and you're live.
Boone: Although I'm curious about the semantic similarity algorithm. 0.95 threshold sounds high — that's pretty strict matching. I wonder what embedding model they're using under the hood.
Izzo: And how they're handling cache invalidation when models get updated or business logic changes. That's always the hard part with caching systems.
Boone: True. Plus, financial workloads are probably more repetitive than general AI usage — lots of similar loan evaluations or risk assessments. That might inflate their hit rates compared to other industries.
Izzo: Fair point. But even if you cut their numbers in half, going from 2+ seconds to under 500ms is still a massive user experience improvement. Definitely. And I'm adding 'build a semantic cache for my local LLM setup' to the weekend project list. This has me curious about the implementation. So if you want to dig into this yourself — go check out their full benchmarking methodology at langsmart.ai. They published the actual test setup, not just marketing claims. And if you're fe