Why AI evals are the new necessity for building effective AI agents

The AI agent market is projected to grow from $5.1 billion in 2024 to over $47 billion by 2030, yet Gartner predicts that more than 40% of agentic AI projects will be canceled by the end of 2027. The reason is not model capability. It is trust.

Traditional AI evaluation tells you whether a model performs well in isolation. Accuracy benchmarks, latency metrics and token efficiency measure what models can do. They do not measure whether users will trust an agent to act on their behalf. As InfoWorld has noted, reliability and predictability remain top enterprise challenges for agentic AI. These are interaction-layer problems, not model-layer problems and they require a different approach to evaluation.

In my experience leading user research for AI-powered collaboration experiences at Microsoft and Cisco, I have observed a consistent pattern: the teams that succeed with agentic AI evaluate agent behavior from the user’s perspective, not just model performance. What follows is a framework for doing exactly that.

The evaluation gap

A 2024 meta-analysis published in Nature Human Behaviour analyzed 106 studies and found something counterintuitive: human-AI combinations often performed worse than either humans or AI alone. Performance degradation occurred in decision-making tasks, while content creation showed gains. The difference was not model quality. It was how humans and AI systems interacted.

This has direct implications for agent builders. Standard benchmarks miss the interaction layer entirely. An agent can score perfectly on retrieval benchmarks and still fail users because it cannot signal uncertainty or interpret requests in ways that diverge from user intent.

Research from GitHub and Accenture reinforces this complexity. While developers using AI assistants completed tasks 55% faster, a GitClear analysis found AI-generated code has 41% higher churn, indicating more frequent revisions. The productivity gains are real, but so is the gap between technically valid outputs and pragmatically correct ones.

Rethinking what AI evaluation should measure

The gap between benchmark performance and user trust points to a fundamental question: what should we actually be evaluating? Traditional metrics tell us whether an agent produced a correct output. They do not tell us whether users understood what the agent was doing, trusted the result or could recover when something went wrong.

This is where user experience methodology becomes essential. UX research has always focused on the gap between what systems do and what users experience. The same methods that reveal usability failures in traditional software reveal trust failures in AI agents. Interaction-layer evaluation applies this lens to agentic AI, shifting focus from “did the model perform well?” to “did the user experience work?”

This reframing reveals three dimensions that determine whether agents succeed in practice.

Does the agent understand what users actually want?

The most common interaction failure is invisible to traditional evaluation. An agent interprets a request differently than the user intended, produces a reasonable response to that interpretation, and passes every accuracy metric. The user, meanwhile, receives something they did not ask for.

This is the intent alignment problem. Standard evaluation cannot detect it because the agent’s interpretation was technically valid. The failure exists only in the gap between what the user meant and what the agent understood.

Effective evaluation measures this gap directly: How often do users correct agent interpretations? How frequently do they abandon tasks after the first response? How many times do they reformulate requests to clarify their original intent? These metrics reveal misalignment that accuracy scores hide.

The major platforms have recognized this challenge. OpenAI’s Operator agent implements explicit confirmation workflows, requiring user approval before consequential actions. Anthropic’s computer use documentation recommends human verification for sensitive tasks, assuming misalignment will occur and building recovery mechanisms accordingly. Microsoft’s HAX Toolkit codifies intent alignment as a design principle with 18 guidelines emphasizing accurate expectation-setting before agents act. Google’s Gemini provides API-level safety controls, leaving interaction-layer confirmation to implementation.

Does the agent know what it does not know?

Agents that express appropriate uncertainty earn trust. Agents that sound confident regardless of their actual reliability erode it. Yet standard evaluation treats all outputs the same: correct or incorrect, with no gradient in between.

This is the confidence calibration problem. Users need to know when to trust agent outputs and when to verify them. Without calibrated uncertainty signals, they either over-rely on unreliable outputs or waste time double-checking everything.

Effective evaluation tracks whether stated confidence levels predict actual reliability. If users override high-confidence outputs at the same rate as low-confidence ones, calibration is broken. If users rubber-stamp approvals regardless of uncertainty indicators, the signals are not reaching them.

Platform approaches to confidence vary significantly. Anthropic explicitly trains Claude to express epistemic uncertainty, with documentation noting that Claude refuses to answer approximately 70% of the time when genuinely uncertain. OpenAI’s models prioritize assertive responses, trading faster task completion against higher hallucination risk. Google provides technical logprobs for developers to assess token-level confidence, though surfacing this to users depends on implementation. Microsoft’s Copilot research found that users who verify AI recommendations make better decisions than those who accept them uncritically.

What do user corrections reveal about agent behavior?

Every time a user modifies agent output, they generate a signal about where the interaction layer is failing. Standard evaluation treats corrections as errors to minimize. Interaction-layer evaluation treats them as diagnostic data.

This is the correction pattern problem. The question is not just how often users correct agents, but what those corrections reveal. Did the agent misunderstand context? Apply wrong assumptions? Produce something technically correct but pragmatically useless?

Effective evaluation categorizes corrections by type and tracks trends over time. Rising rates in specific capability areas signal systematic problems. Consistent patterns across users reveal gaps that no benchmark would detect.

LinkedIn’s agentic AI platform, built on Microsoft’s infrastructure, captures this systematically: all generated emails must be editable and explicitly sent by the user, logging not just whether users edited but what they changed. Google’s PAIR Guidebook, used by over 250,000 practitioners, treats user corrections as training signal for understanding where models diverge from user mental models. Anthropic’s Constitutional AI uses structured feedback to identify systematic gaps between model behavior and user expectations, informing model updates rather than just flagging failures.

How UX research methods strengthen agent evaluation

Traditional AI evaluation relies on automated metrics. Interaction-layer evaluation requires understanding user behavior in context. This is where UX research methodology offers tools that engineering teams often lack.

Task analysis identifies where agents need evaluation checkpoints. By mapping user workflows before building, teams discover high-stakes moments where intent misalignment causes cascading failures. An agent that misinterprets a request early in a complex workflow creates errors that compound with each subsequent step.
Think-aloud protocols surface confidence calibration failures invisible to telemetry. When users verbalize their reasoning while interacting with agents, they reveal whether uncertainty signals are registering. A user who says “I guess this looks right” while approving a high-confidence output is exhibiting automation bias. No log file captures this; observation does.
Correction taxonomies transform user modifications into actionable product signals. Rather than counting corrections as a single metric, categorize them: Did the agent misunderstand the request? Apply incorrect assumptions? Generate something technically valid but contextually wrong? Each category points to a different intervention.
Diary studies for trust evolution over time. Initial agent interactions look nothing like established usage patterns. A user might over-rely on an agent in week one, swing to excessive skepticism after a failure in week two, then settle into calibrated trust by week four. Cross-sectional usability tests miss this arc entirely. Longitudinal diary studies capture how trust calibrates, or miscalibrates, as users build mental models of what the agent can actually do.
Contextual inquiry for environmental interference. Lab conditions sanitize the chaos where agents actually operate. Watching users in their real environment reveals how interruptions, multitasking and time pressure shape how they interpret agent outputs. A response that seems clear in a quiet testing room gets confusing when someone is also checking Slack.

Just as important is collecting feedback in the moment. Ask users how they felt about an interaction three days later and you get rationalized summaries, not ground truth. For example, I did a research study to evaluate a voice AI agent, where I asked users to interact with it four times, with four different tasks, and collected user feedback immediately, in the moment, after every task. I collected feedback on the quality of conversation, turn-taking and tone changes and how that impacts the user and their trust in the AI.

This sequential structure catches what single-task evaluations miss. Did turn-taking feel natural? Did a flat response in task two make them speak more slowly in task three? By task four, you’re seeing accumulated trust or erosion from everything that came before.

These methods complement automated evaluation by revealing failure modes that metrics miss. Teams that integrate UX research into their evaluation cycle catch trust failures before they reach production.

Building AI evals into product development

Databricks’ approach to agent evaluation, using LLM judges alongside synthetic data generation, points toward scalable methods. But automated evaluation cannot replace understanding how users experience agent behavior in production.

Effective AI product development integrates interaction-layer evaluation throughout the cycle. This means defining evaluation criteria before building, not after. It means instrumenting for user behavior, not just model performance. Traditional observability captures latency and error rates; interaction-layer observability captures task abandonment, reformulation frequency and the nature of user corrections.

For teams building on foundation models from OpenAI, Anthropic, Google or Microsoft, evaluation cannot stop at API-level metrics. The same model succeeds or fails depending on how the interaction layer surfaces capabilities and limitations to users.

The trust imperative

The research is clear: Human-AI collaboration improves outcomes when agents behave in ways users can understand and predict. It degrades outcomes when agent behavior is technically correct but pragmatically opaque.

Model capability is no longer the bottleneck. The bottleneck is the interaction layer. Trust is not built by better benchmarks. It is built by evaluating the dimensions that benchmarks miss.

The teams that build effective AI agents will evaluate what matters to users, not just what matters to model developers. That trust will determine which agentic AI projects succeed and which join the 40% that get canceled.

Note: All views expressed are my own.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?