Justy: So we just paid like four hundred bucks this quarter for Azure OpenAI to churn through a bunch of engineering drawings that probably could’ve been scraped with regex and a little patience.
Cody: Right. And you’re telling me we could’ve saved seventy-five percent of that cost if we’d just asked, ‘does this doc even need a cloud call?’ before we spun up the endpoint.
Justy: Yeah. That’s the Local-First AI Inference pattern the article lays out — seventy to eighty percent of predictable PDFs route to deterministic local extraction at zero cost, then you only hit the cloud for the weird twenty or thirty percent.
Cody: Mm-hm. And they built a composite scoring function to route reliably — not just text presence, not just a single metric, but four criteria: spatial, anchor, format, and contextual signals.
Cody: Spots the difference between a title block scoring ninety-eight and a revision history scoring sixty-six on the same character string.
Justy: Exactly. And once you wire the tiers, you’ve got a three-tier architecture: local deterministic, cloud AI as a last resort, and a human review lane to catch the rare misses. Bounds the error rate so the cloud-only ‘silent hallucination’ problem and the local-only ‘missed scans’ problem both go away.
Cody: Sure. They ran it on four thousand seven hundred engineering drawings. Cloud-first would’ve cost forty-seven bucks and eaten a hundred minutes of wall time.
Cody: Local-first cut it to ten to fifteen dollars and forty-five minutes, and they kept the human review queue tight enough to stay sane.
Justy: But Cody, who’s actually adopting this? Verticals with boring, repeatable layouts — invoices, regulatory filings, those kinds of documents — they’re the sweet spot.
Cody: Wait—you’re saying teams will refactor their whole pipeline for twenty-cent savings per doc?
Justy: Not for the twenty-cent savings, for the forty-seven-dollar surprise that shows up on next month’s Azure bill when thirty percent of those invoices hallucinate a line item and you have to queue them all for review.
Cody: Fine. But you’re underestimating the effort to tune that composite scorer. Five iterations of prompt hammering just to get from eighty-nine to ninety-eight percent accuracy.
Justy: That’s the point — you only do that work once your corpus is stable enough to justify it. The alternative is shipping a cloud endpoint today and debugging drift six weeks later when the model quietly invents a new line item.
Cody: Task-specific validation sets over vendor benchmarks, they say. They upgraded from GPT-4.1 to GPT-5+ and saw zero lift on their four-hundred-file set.
Justy: Which means if you’re not measuring on your own documents, you’re paying for a shiny model that doesn’t move the needle.
Cody: Okay, but who maintains the human review lane once it’s live? That queue isn’t free.
Justy: It’s cheaper than twice-a-year surprise cloud bills when half your corpus leaks into hallucination retry loops.
Cody: Fair. Still feels like a lot of plumbing for edge cases most teams will never hit.
Justy: Maybe. But if your business runs on predictable layouts fifty weeks a year, the local-first router pays for itself the first time a cloud endpoint silently invents an invoice total.
Cody: So this is basically a refactoring play for orgs that hate surprise Azure bills and love saying ‘I could’ve written a regex for that.’
Justy: More like they love not getting woken up at 2 a.m. because the model hallucinated a missing PO number.
Cody: Yeah. Still sounds like a weekend project someone in DevOps builds once, then forgets to maintain.
Justy: Only if they treat the composite scorer like a fire-and-forget script instead of a product component tied to a stable corpus.
Cody: So the real test is whether you’ll still be running the same router a year from now when the invoice layouts drift.
Justy: That’s why you keep the human review lane: it’s the canary that tells you when the composite starts to stink.
Cody: Alright. Let’s say I buy it. How do we actually poke at this?
Justy: Two ways. One, a solo hack: grab pdfminer.six, wire a tiny composite scorer — local if format match and anchor present, cloud else — and use a local LLM as the fallback edge case.
Cody: Two hundred lines of Python, no cloud budget required.
Justy: Perfect. And two, if you’re on a team: drop a tiny Azure Function in front of your existing pipeline. It logs every local hit and every cloud call, and spits out cost deltas and latency over a week of real docs. Then you decide whether to refactor.
Cody: So the verdict is: if your corpus is stable and layout-predictable, the three-tier Local-First router is worth the upfront legwork.
Justy: Yep. Otherwise, keep hitting the cloud endpoint and budget for surprise retries — at least until the bill hits your desk.
Cody: Good thing we’re not shipping anything mission-critical this week.
Justy: True. Coffee’s cold anyway. Anyway—thread this thing into our backlog before someone spins up another hundred-dollar OpenAI run just to parse a PDF that’s mostly boxes and arrows.