Justy: Anyway — this LangGraph fault-tolerance post is perfect timing because our last agent run that took, like, twelve hours just crapped out at step eight and we had to start over.
Cody: Right… because that’s the argument: throw in some RetryPolicy and TimeoutPolicy and your month-long agent run never dies.
Justy: Yeah, exactly — no more cold starts at three a.m. when an LLM 500s mid-run.
Cody: Sure, but their default retry_on list is ConnectionError, 5xx, and the occasional TimeoutError.
Justy: Which is like… the exact set of things that happen when a prod agent is live.
Cody: Mm-hm… if you squint past ValueError, TypeError, or the time your dev forgot to pass the API key and every node throws a RuntimeError.
Justy: Okay okay — so you tweak retry_on and move on.
Cody: Wait— their RetryPolicy also includes exponential backoff and jitter, but the article doesn’t show a percent chance of retry fatigue under load.
Justy: True, but the point is you don’t ship twenty-five lines of retry boilerplate per node anymore.
Cody: Fair. What about timeouts?
Justy: They’ve got run_timeout for the whole node attempt and idle_timeout for the gap between steps.
Cody: Sure, but if your LLM returns tokens in bursts, a thirty-second run_timeout might cut off a token stream that’s actually still valid.
Justy: So set it higher or use idle_timeout?
Cody: Their docs imply run_timeout is wall-clock, which is fine until someone expects it to match token generation.
Justy: I see… so the post sells the primitives but doesn’t tell you how to tune them.
Cody: And the error_handler section is just a second node that gets the failure context—sounds elegant until your handler throws and now you’re in a loop.
Justy: Okay, but in practice you still save the run instead of aborting mid-stream, which is more than most tools give you.
Cody: True. Fine, the design keeps the happy path short.
Justy: So… we care?
Cody: If you’re running hours-long agent graphs today, yes. If you’re still prototyping, add the primitives later.
Justy: Right — which is exactly the order most teams do it in anyway.
Cody: That tracks.
Justy: Anyway — for Build Next: wrap your node in TimeoutPolicy with run_timeout set to max_tokens divided by your model’s token rate, and idle_timeout longer than your longest tool call; then tweak retry_on to your actual transient set.
Cody: And test the error_handler in staging before you ship it to prod.
Justy: Obvious, but apparently not obvious enough.
Cody: Yeah. See you next week.