Justy: This is Exploring Next, episode 344. Definity wants to catch pipeline failures while the job is still running, which matters a lot more now that busted data can quietly break AI features.
Cody: My skeptical read, Justy, is the pitch is strong because the pain is real, but the claims only matter if teams actually let this thing act in production. Monitoring after the fact is obviously too late sometimes. The harder question is whether an embedded agent can intervene without becoming one more risky moving part.
Justy: Yeah, but that's also why this lands right now. Real people see the symptom as, why is this assistant wrong today, or why did the workflow miss something obvious. Underneath that, it can just be stale data showing up on time-looking dashboards.
Cody: What they're doing is pretty specific. They install a JVM agent inside Spark and say it can also sit in the dbt driver path, so instead of scraping logs after the run, it sees memory pressure, skew, shuffle patterns, and utilization during execution.
Justy: And the dynamic lineage claim matters because inferring relationships without a perfect catalog removes a real blocker in big companies.
Cody: The clever part is placement. If an upstream job got preempted and the input table is stale, they can stop the downstream run before bad data propagates. That's meaningfully different from tools that mostly tell you what already happened.
Justy: The buyer doesn't need full robot control on day one. Even a strong early warning system with context already assembled is valuable for a data platform lead supporting internal products or AI features.
Cody: I get cautious when it crosses from observation into action. Modifying resources mid-run or stopping jobs is powerful, but each control can also create a new failure if the policy is wrong.
Justy: So the barrier feels less like install friction and more like trust. Teams will likely start with read-only or recommend-only mode.
Cody: The strongest use case may be on-prem Spark, where inefficiency directly costs hardware capacity because you can't just burst into more cloud nodes. There, seeing skew and waste during the run is money.
Justy: And that ROI story is believable for lean enterprise teams, even if you discount vendor numbers.
Cody: My only real caution is scope creep. Detection and prevention can be real time, but diagnosis still has a human in the loop, which honestly makes the product sound more credible to me.
Justy: Mine is a little more commercial than yours. I think this category has room because reactive observability feels old when downstream systems are making decisions continuously. But yeah, the rollout path matters. If they force autonomy too early, buyers will stall.
Cody: Build next, I'd try two things. One, spin up a local Spark job and inject failure conditions you can actually measure. Use Apache Spark with event logs on, then compare what you can infer from logs after the run versus what you'd want during the run. Two, for a solo builder, make a tiny guardrail service that checks upstream table freshness before launching a dbt or Spark step. Even a simple Python wrapper around job submission teaches the core idea.
Justy: And if you want a weekend version, grab dbt Core plus a sample warehouse, add source freshness checks, and gate a downstream model on that result. It's nowhere near Definity, but you'll feel the user story immediately. Cody, I think that's the episode. We're in your kitchen, I'm under-caffeinated, and we'll catch you next time. [laughs]