Justy: So this paper is basically saying the annoying part isn’t whether LLMs can pull fields out of filings. It’s which little machine you wrap around them so the thing doesn’t get expensive and flaky the second you point it at real volume.
Cody: Yeah, exactly. The paper’s nice because it stops hand-waving and compares four orchestration patterns on the same job, same corpus, same field set. They’re looking at 10,000 SEC filings, so 10-Ks, 10-Qs, 8-Ks, and 25 extraction field types, which is a pretty decent stress test for structured financial docs.
Justy: Mm-hm.
Cody: And the whole motivation is familiar if you’ve ever tried to ship this stuff. Single-shot extraction hits context limits, chunking breaks cross-references, and once the output gets complicated, you need some way to verify it instead of just trusting the first answer that came back.
Justy: That’s the part that feels production-real to me. A team can get a demo working on ten filings in a notebook, and then suddenly the quarter-end batch lands and everybody’s asking why the bill jumped or why one compensation table got parsed three different ways.
Cody: Right. The four patterns are pretty intuitive once you hear them. Sequential is the plain pipeline, parallel fan-out sends work to multiple agents and merges, hierarchical puts a supervisor in charge of workers, and reflexive is the self-correcting loop where the system re-checks itself before it settles on an answer.
Justy: Yeah, and I like that they’re not pretending those are just academic names. Those are basically four different ways of spending money and latency to buy confidence.
Cody: Exactly. And the headline result is that reflexive gets the best field-level F1 at 0.943, but it costs about 2.3 times the sequential baseline. Hierarchical lands at 0.921 F1 at about 1.4 times cost, which is why it looks like the best point on the cost-accuracy frontier if you actually care about budgets.
Justy: Okay, that’s the kind of number that makes a product person sit up. A few points of F1 is not nothing, but if the cheaper setup is already close enough for audit workflows or internal analytics, the reflexive thing starts looking a little like luxury trim.
Cody: Mm-hm.
Cody: The interesting part is the ablations. They add semantic caching, model routing, and adaptive retries, and the hybrid setup recovers 89% of the reflexive accuracy gains at only 1.15 times baseline cost. That’s the practical nugget for me, because it says a lot of the value is in the control plane, not just the biggest model.
Justy: Wait, that sounds like the boring infrastructure piece is secretly the hero. Which, honestly, rude to the fancy loop, but also believable.
Cody: No, seriously. If a document or field looks low-confidence, you route it differently, cache repeated structure, and only spend the extra tokens when the first pass looks shaky. That’s a much more shippable story than every record getting the full expensive treatment.
Justy: I was in a coffee shop earlier trying to answer emails on bad Wi-Fi and that felt like a tiny version of this. Everything’s fine until one slow retry makes the whole thing feel cursed. Anyway, in production I’d want the system to know when to stop being clever and just take the cheaper path.
Cody: Yeah, and the paper’s scaling analysis is the other half of that. They push from 1K to 100K documents per day and see non-obvious throughput-accuracy degradation curves, which is a fancy way of saying you can’t assume the same architecture behaves nicely once you saturate queues and retry paths.
Justy: Right, that’s the piece that gets missed in a lot of benchmark posts. The model score looks great, and then the ops graph says something totally different once the batch jobs stack up and the latency tails start stretching.
Cody: Exactly. I think the methodology is solid because they’re not only measuring F1. They track document-level accuracy, end-to-end latency, cost per document, and token efficiency. That’s what makes the trade-off surface real instead of just leaderboard theater.
Justy: Do you think the reflexive setup is actually worth it anywhere, or is it mostly a research flex?
Cody: I think it depends on the failure cost. If you’re doing high-stakes extraction where one bad field causes a downstream exception, then the extra cost might be justified. But for a lot of product work, hierarchical plus selective retry is probably the better first ship, because it gets you most of the way there without turning every document into a mini-debate.
Justy: Yeah, that tracks. And I do like that this isn’t some fantasy about replacing the whole pipeline. It feels like an argument for putting the LLM in a system with rules, routing, and a budget, which is way more believable.
Cody: Exactly. If I were building this, I’d start with a hierarchical supervisor-worker pipeline, add semantic caching early, and only turn on reflexive retries for fields with low confidence or known ambiguity. For a solo builder, you could do the same thing on a small SEC filing set with an open-weight model, a simple router, and a cache keyed by document section and field type.
Justy: That’s a nice weekend project, honestly. Grab a handful of filings, wire up one baseline parser, one supervisor, and one retry path, and see where the cost curve breaks before you ever think about fancy scale.
Cody: Mm-hm. And I’d keep an eye on throughput as soon as you simulate batch growth, because the paper’s main warning is that scaling changes the shape of the problem. The architecture that looks best at 1,000 docs a day may not be the one you want at 100,000.
Justy: So the takeaway is kind of boring in the best way: build the least dramatic system that still verifies itself when it matters, then prove it doesn’t fall apart when the queue gets ugly.
Cody: Yeah, that’s the one. The reflexive loop is the flashy one, but the hierarchical setup with routing and retries is probably where a real team starts if they want something they can defend in production.
Justy: Alright, that’s enough filing drama for one drive. I’m still sticking with the cheaper path first, Cody, but I’ll admit the self-correcting loop has a little swagger.