Justy: Anyway — this LangGraph fault-tolerance post is perfect timing because our last agent run that took, like, twelve hours just crapped out at step eight and we had to start over. Cody: Right… because that’s the argument: throw in some RetryPolicy and TimeoutPolicy and your month-long agent run never dies. Justy: Yeah, exactly — no more cold starts at three a.m. when an LLM 500s mid-run. Cody: Sure, but their default retry_on list is ConnectionError, 5xx, and the occasional TimeoutError. Justy: Which is like… the exact set of things that happen when a prod agent is live. Cody: Mm-hm… if you squint past ValueError, TypeError, or the time your dev forgot to pass the API key and every node throws a RuntimeError. Justy: Okay okay — so you tweak retry_on and move on. Cody: Wait— their RetryPolicy also includes exponential backoff and jitter, but the article doesn’t show a percent chance of retry fatigue under load. Justy: True, but the point is you don’t ship twenty-five lines of retry boilerplate per node anymore. Cody: Fair. What about timeouts? Justy: They’ve got run_timeout for the whole node attempt and idle_timeout for the gap between steps. Cody: Sure, but if your LLM returns tokens in bursts, a thirty-second run_timeout might cut off a token stream that’s actually still valid. Justy: So set it higher or use idle_timeout? Cody: Their docs imply run_timeout is wall-clock, which is fine until someone expects it to match token generation. Justy: I see… so the post sells the primitives but doesn’t tell you how to tune them. Cody: And the error_handler section is just a second node that gets the failure context—sounds elegant until your handler throws and now you’re in a loop. Justy: Okay, but in practice you still save the run instead of aborting mid-stream, which is more than most tools give you. Cody: True. Fine, the design keeps the happy path short. Justy: So… we care? Cody: If you’re running hours-long agent graphs today, yes. If you’re still prototyping, add the primitives later. Justy: Right — which is exactly the order most teams do it in anyway. Cody: That tracks. Justy: Anyway — for Build Next: wrap your node in TimeoutPolicy with run_timeout set to max_tokens divided by your model’s token rate, and idle_timeout longer than your longest tool call; then tweak retry_on to your actual transient set. Cody: And test the error_handler in staging before you ship it to prod. Justy: Obvious, but apparently not obvious enough. Cody: Yeah. See you next week.