Izzo: Your AI agent just got stuck trying to create a ticket in ServiceNow. Again.
Izzo: You're listening to Exploring Next, episode two-ten. I'm Izzo, and with me is Boone. Today we're diving into Guidde—a startup that just raised fifty million to train AI agents on video recordings of human experts instead of dusty PDFs.
Boone: And this is actually brilliant timing, Izzo. We're hitting this weird inflection point where companies are deploying AI agents but they keep failing at the last mile.
Izzo: Right, like you spend a million dollars on some fancy AI tool, but nobody knows how to use it because all you got was a thirty-minute training session.
Boone: Exactly. And the agents themselves are even worse—they hallucinate the moment they hit your actual enterprise software because GPT-4 was never trained on your specific Salesforce setup.
Izzo: So Guidde's approach is fascinating. Instead of feeding agents static documentation, they capture what they call 'Video Ground Truth'—basically recording human experts as they navigate complex software.
Boone: But here's the clever part—they're not just recording pixels. They're capturing every click, scroll, DOM change, even the subtle pauses when a system lags.
Izzo: Boone, break that down for me. What's the technical difference between a screen recording and what Guidde's doing?
Boone: Think of it like this—a normal screen recording is like filming a driver. Guidde is capturing the steering wheel angle, brake pressure, and GPS coordinates. They're synchronizing video frames with underlying HTML metadata.
Izzo: So they're building what they call a 'digital world model' of enterprise software. Each company gets their own unique map.
Boone: Right, and that's their moat. Because every enterprise uses a different mix of apps and custom configurations. You can't just train on generic Salesforce—you need the telemetry from how this specific company actually uses it.
Izzo: The architecture is interesting too. They're using a fleet of models instead of relying on one. Gemini for visual tasks, Claude for narrative scripts.
Boone: Smart move. And they've got feedback loops—when users edit videos, that data goes back into the training to prevent the same mistakes. It's like having models fact-check each other.
Izzo: From a product standpoint, I love that they're solving two problems at once. The same videos that train humans also train the AI agents.
Boone: It's like building Waymo for computer usage. The human demonstrations become the training data for autonomous navigation of enterprise UIs.
Izzo: And the numbers are solid—forty-one percent reduction in video creation time, thirty-four percent fewer support tickets. That's real impact.
Boone: Plus they've automated the whole production pipeline. Used to take weeks to create one training video with multiple teams. Now it's seconds.
Izzo: The Magic Redaction feature is clever too—automatically obscuring sensitive data during capture so it stays HIPAA-compliant.
Boone: Yeah, and they're replacing like six different tools—Loom, Adobe Premiere, ElevenLabs, Synthesia—with one AI-native platform.
Izzo: What I find compelling is the timing. We're right at this moment where agentic AI is becoming real, but the knowledge infrastructure is completely broken.
Boone: Totally. Foundation models are great at general reasoning but terrible at 'click the third button from the left in your custom SAP interface.'
Izzo: And with forty-five hundred enterprise customers already, they're building a dataset that's really hard to replicate.
Boone: The Vision-Language-Action training sets they're creating—that's the secret sauce. Most companies are still thinking in terms of text documentation.
Izzo: I'm giving this approach a solid A-minus. The execution looks strong, the timing is perfect, and they're solving a real pain point.
Boone: Only thing I'd want to see is how well the agents actually perform in production versus demos. But the technical foundation is sound.
Izzo: So what should people go build with this? First, check out Guidde's free tier—you can capture up to twenty-five videos and see how the telemetry capture works.
Boone: Second, if you're building agents, look into Vision-Language-Action frameworks. There's some interesting work happening in the VLA space that connects to this.
Izzo: And third, experiment with multimodal model orchestration. The idea of using different models for different tasks instead of one giant model is worth exploring. Adding that to the weekend project list. Though at this point, I might need a separate spreadsheet just for the backlog. Next time your AI agent gets confused navigating enterprise software, remember—it probably just needs better training data. We'll be back in two weeks.