Izzo: Your AI demo agent works perfectly until it hits production and silently fails a customer refund. Izzo: You're listening to Exploring Next, episode two-thirty-one. I'm Izzo, and today Boone and I are diving into why evaluating AI agents is completely different from testing regular software. Boone: Right, and this InfoQ piece from Amit Kumar Padhy really nails the core problem — we're still using single-turn accuracy metrics on systems that plan, call APIs, and maintain state across multiple interactions. Izzo: Exactly. So Boone, what's actually breaking when these agents hit real workflows? Boone: The failure modes are fascinating. Picture this: agent correctly identifies a shipping exception in step one, refund API throws an unexpected error in step two, agent silently skips the refund and marks the case resolved. No BLEU score would catch that. Izzo: That's terrifying from a product perspective. We're basically flying blind on the parts that actually matter to users. Boone: Exactly. And it gets worse — these agents are composite systems. They're planning actions, invoking tools, maintaining memory. Classical NLP metrics like ROUGE weren't designed for dynamic behavior. Izzo: Okay, so what does proper agent evaluation actually look like? Break this down for me. Boone: The article outlines five pillars: intelligence, performance, reliability, responsibility, and user experience. But the key insight is hybrid evaluation — you need both automated scoring and human judgment. Izzo: I'm giving hybrid evaluation an A-plus already. Automation gives you scale and repeatability, humans catch the stuff that matters but can't be measured. Boone: Right. For automation, they're using LLM-as-a-judge patterns, trace analysis, load testing. The human side covers tone, trust, contextual appropriateness — basically everything that makes users actually want to use the thing. Izzo: And I'm betting operational constraints are huge here. Latency, cost per task, token efficiency? Boone: Absolutely. The article calls these first-class evaluation targets, not afterthoughts. A technically brilliant agent that burns through your API budget or takes thirty seconds per interaction isn't viable at enterprise scale. Izzo: What about the tooling ecosystem? Are we finally getting frameworks that can handle this complexity? Boone: It's maturing fast. MLflow 3.0 now has experiment tracing and built-in LLM judge capabilities. TruLens does pluggable feedback functions with OpenTelemetry integration. LangChain Evals has task-specific evaluation chains. Izzo: Wait, OpenTelemetry integration? That's smart — you can pipe agent behavior directly into your existing observability stack. Boone: Exactly. And they show a minimal Claude plus LangChain example that demonstrates both reference-free helpfulness scoring and reference-aware correctness checking. The pattern extends naturally to multi-step traces. Izzo: But there's a privacy landmine here, right? Real operational data has PII all over it. Boone: Huge point. Before logging prompts, traces, or judge rationales, you need redaction pipelines. The article specifically calls out avoiding customer data exposure in evaluation logs. Izzo: From a go-to-market angle, who's actually implementing this? It sounds like e-commerce teams are leading the charge. Boone: The examples are all commerce workflows — order exception triage, pricing validation, payment issue investigation, L2 incident response. Makes sense, these are high-stakes, multi-step processes where silent failures hurt. Izzo: And they're moving from controlled sandbox testing to production deployment. That transition is where everything breaks down without proper evaluation. Boone: The failure modes they list are telling: fragile planning, unreliable API calls, memory drift across sessions, inconsistent multi-turn behavior. None of that shows up in traditional benchmarks. Izzo: This feels like the moment where AI agents either become genuinely useful or stay in demo hell forever. Boone: Right. And the article emphasizes that evaluation isn't a one-time gate — it's a continuous loop feeding back into agent design at every stage. I'm definitely adding a proper agent evaluation pipeline to my weekend project list. Izzo: Okay, what should people actually go build? Give me three concrete next steps. Boone: First, clone that reference repository they mention and run the Claude plus LangChain evaluation example. Second, if you're using MLflow, upgrade to 3.0 and experiment with the built-in LLM judge capabilities. Izzo: And third? Set up trace-based analysis for any agent you're running. Start logging tool calls, API responses, and state transitions. You can't evaluate what you can't observe. Perfect. The era of hoping your AI agent works in production is officially over. Time to measure what actually matters. That's a wrap on episode two-thirty-one of Exploring Next. Next time we're exploring something that'll probably make Boone add three more items to his project backlog.