Evaluating AI agents in production: how to measure quality without fooling yourself

Evaluating AI agents in production is a continuous loop of measuring performance across key dimensions like faithfulness, relevance, completeness, and safety. Because these non-deterministic systems can drift, regress, and break silently on edge cases, formal evaluation is critical to avoid operating in the dark.

TL;DR AI agents are not deterministic software — they drift, regress on model changes and break silently on edge cases. Without formal evaluation, you're operating in the dark. Practical framework: gold set, 4 metrics, continuous monitoring and human spot review.

In traditional software, "passed tests" is binary. In AI, it's a distribution: 92% of cases pass, 8% degrade, and today's degraded case can become tomorrow's catastrophic one. That's why evaluation is not a phase — it's a continuous loop.

4 evaluation dimensions

Dimension 1

Faithfulness

Is the response grounded in the retrieved data? This is the anti-hallucination metric.

How to measure: LLM-as-judge (Gemini Pro evaluating response vs context) + weekly human sampling.

Dimension 2

Relevance

Does the response address the question? Penalizes correct but off-topic answers.

How to measure: embedding similarity + LLM-as-judge.

Dimension 3

Completeness

Does it cover all relevant aspects? Critical for multi-part questions.

How to measure: human rubric or LLM-as-judge with explicit criteria.

Dimension 4

Safety

Does it avoid prohibited content (PII leakage, bias, inappropriate recommendations)?

How to measure: specific classifier + rule-based + human sampling.

Ignoring one dimension = surprise in production. Covering all 4 = a solid foundation.

Gold set: the most underrated asset

A gold set is the collection of (question, expected answer) pairs curated by humans. It's what separates real evaluation from "guesswork".

How to build one

Minimum size: 50–100 cases for pilot, 300–500 for production, 1,000+ for critical systems.
Diversity: cover all main intents, known edge cases, ambiguities.
Expected answer goes beyond content: desired format, citations, tone.
Multi-reviewer annotation: two humans per case, with a third arbitrating disagreements.
Versioning: each version has a hash + date + owner.

How to maintain it

Every real regression in production becomes a new case in the gold set.
Every business rule change triggers a review of affected cases.
Quarterly review: remove obsolete cases, add new scenarios.

Without a gold set, you're iterating prompts while looking at 3 examples in Slack. It's the difference between engineering and guessing.

When to evaluate

Moment	What to run
Prompt change	Full automated gold set
Model upgrade (2.5 → 3)	Gold set + human review of 100 cases
RAG change (chunking, reranker)	Retrieval-focused gold set
Daily in production	Sample of 50–100 conversations
Weekly	Drift analysis, top categories with decline
Monthly	Deep human review of 200 cases
Quarterly	Bias audit, fairness by segment

LLM-as-judge: using it well

LLM evaluating LLM has known biases (favors longer responses, same-model outputs, assertive tone). To use it well:

Different model from the one being tested whenever possible.
Explicit rubric: numbered criteria, not "is it good?".
Calibration against human: for every 200 cases, 20 with parallel human review. Agreement < 80% → revise the rubric.
Multiple rounds: 3 evaluations with different seeds, aggregation by median.
Cite the passage: the judge needs to explain why it scored that way — makes auditing easier.

Drift: the silent killer

⚠️ Invisible drift An agent that worked in January degrades in April because the question distribution changed, the RAG was updated, or Google rolled a new Gemini snapshot. Without drift monitoring, you find out from customer support complaints.

How to detect it:

Monitor embedding distribution of questions: if the cluster shifts, alert.
Distribution of tool calls: if a previously used tool disappears, investigate.
Latency by intent: sudden increase = behavior change.
Fallback rate (agent says "I don't know"): if it rises, RAG is losing coverage.
Human channel complaints: lagging indicator but reliable.

Product metrics (not just model metrics)

Evaluating an agent without product metrics means optimizing for the wrong thing:

Autonomous resolution rate: % of conversations ended without a human.
Post-conversation CSAT: single question at the end.
Resolution time: compared to the human baseline.
Return rate: user came back within 24h with the same question = initial answer wasn't enough.
Cost per conversation: tokens × price + tools.
Conversion (in commercial use): requested a quote, scheduled, purchased.

Stack in Autenticare projects

Vertex AI Evaluation: native, plugged directly into the Gemini Enterprise agent.
BigQuery: stores conversations, scores, metadata. Ad-hoc SQL.
Looker: quality and drift dashboards.
Cloud Run jobs: run the gold set daily, alert on regressions.
PagerDuty: human alert when a key metric drops below threshold.
Weekly notebook: consultant does a deep dive on 50 real conversations and produces a report.

Minimum checklist before going to production

Gold set > 100 cases, 2+ reviewers

Diversity of intents, edge cases, ambiguities.

Automated pipeline at every deploy

Gold set run before any promotion to production.

4 metrics with thresholds

Faithfulness, relevance, completeness, safety. Below minimum → blocks deploy.

Live dashboard + alert

Regression > 5% on any metric triggers PagerDuty.

Weekly human review + owner with pause authority

50–100 sampled conversations and a named person with authority to stop the agent.

If any item is missing, the agent is not ready.

Frequently Asked Questions sobre Evaluating AI agents in production: how to measure quality without fooling yourself

Why is continuous evaluation important for AI agents in production? AI agents are non-deterministic and can exhibit drift, regression, and silent failures. Continuous evaluation ensures that quality is monitored and maintained over time.

What are the four main dimensions for evaluating AI agents? The four dimensions are: Faithfulness (whether the answer is grounded in the data), Relevance (whether the answer addresses the question), Completeness (whether it covers all relevant aspects), and Safety (whether it avoids prohibited content).

What is a ‘gold set’ and why is it important? A ‘gold set’ is a set of (question, expected answer) pairs curated by humans. It is essential for accurate and objective evaluation of AI agents.

How often should I run evaluations in production? It is recommended to run evaluations daily with conversation samples, weekly for drift analysis, monthly with human review, and quarterly for bias auditing.

Quality audit

Is your agent already in production without formal evaluation?

Autenticare runs a 2-week audit: builds the initial gold set, configures 4 metrics, installs a drift dashboard. We deliver the loop running, not just a report.

Request audit → RAG with Vertex AI Search