Autenticare
Google Tools · · 9 min

Evaluating AI agents in production: how to measure quality without fooling yourself

Without continuous evaluation, AI agents degrade and no one notices — until the customer complains. The evaluation framework we use in all Gemini Enterprise projects: gold set, metrics, monitoring and drift.

Fabiano Brito

Fabiano Brito

CEO & Founder

Evaluating AI agents in production: how to measure quality without fooling yourself
TL;DR AI agents are not deterministic software — they drift, regress on model changes and break silently on edge cases. Without formal evaluation, you're operating in the dark. Practical framework: gold set, 4 metrics, continuous monitoring and human spot review.

In traditional software, "passed tests" is binary. In AI, it's a distribution: 92% of cases pass, 8% degrade, and today's degraded case can become tomorrow's catastrophic one. That's why evaluation is not a phase — it's a continuous loop.


4 evaluation dimensions

Dimension 1

Faithfulness

Is the response grounded in the retrieved data? This is the anti-hallucination metric.

How to measure: LLM-as-judge (Gemini Pro evaluating response vs context) + weekly human sampling.

Dimension 2

Relevance

Does the response address the question? Penalizes correct but off-topic answers.

How to measure: embedding similarity + LLM-as-judge.

Dimension 3

Completeness

Does it cover all relevant aspects? Critical for multi-part questions.

How to measure: human rubric or LLM-as-judge with explicit criteria.

Dimension 4

Safety

Does it avoid prohibited content (PII leakage, bias, inappropriate recommendations)?

How to measure: specific classifier + rule-based + human sampling.

Ignoring one dimension = surprise in production. Covering all 4 = a solid foundation.


Gold set: the most underrated asset

A gold set is the collection of (question, expected answer) pairs curated by humans. It's what separates real evaluation from "guesswork".

How to build one

  • Minimum size: 50–100 cases for pilot, 300–500 for production, 1,000+ for critical systems.
  • Diversity: cover all main intents, known edge cases, ambiguities.
  • Expected answer goes beyond content: desired format, citations, tone.
  • Multi-reviewer annotation: two humans per case, with a third arbitrating disagreements.
  • Versioning: each version has a hash + date + owner.

How to maintain it

  • Every real regression in production becomes a new case in the gold set.
  • Every business rule change triggers a review of affected cases.
  • Quarterly review: remove obsolete cases, add new scenarios.
Without a gold set, you're iterating prompts while looking at 3 examples in Slack. It's the difference between engineering and guessing.

When to evaluate

MomentWhat to run
Prompt changeFull automated gold set
Model upgrade (2.5 → 3)Gold set + human review of 100 cases
RAG change (chunking, reranker)Retrieval-focused gold set
Daily in productionSample of 50–100 conversations
WeeklyDrift analysis, top categories with decline
MonthlyDeep human review of 200 cases
QuarterlyBias audit, fairness by segment

LLM-as-judge: using it well

LLM evaluating LLM has known biases (favors longer responses, same-model outputs, assertive tone). To use it well:

  • Different model from the one being tested whenever possible.
  • Explicit rubric: numbered criteria, not "is it good?".
  • Calibration against human: for every 200 cases, 20 with parallel human review. Agreement < 80% → revise the rubric.
  • Multiple rounds: 3 evaluations with different seeds, aggregation by median.
  • Cite the passage: the judge needs to explain why it scored that way — makes auditing easier.

Drift: the silent killer

⚠️ Invisible drift An agent that worked in January degrades in April because the question distribution changed, the RAG was updated, or Google rolled a new Gemini snapshot. Without drift monitoring, you find out from customer support complaints.

How to detect it:

  • Monitor embedding distribution of questions: if the cluster shifts, alert.
  • Distribution of tool calls: if a previously used tool disappears, investigate.
  • Latency by intent: sudden increase = behavior change.
  • Fallback rate (agent says "I don't know"): if it rises, RAG is losing coverage.
  • Human channel complaints: lagging indicator but reliable.

Product metrics (not just model metrics)

Evaluating an agent without product metrics means optimizing for the wrong thing:

  • Autonomous resolution rate: % of conversations ended without a human.
  • Post-conversation CSAT: single question at the end.
  • Resolution time: compared to the human baseline.
  • Return rate: user came back within 24h with the same question = initial answer wasn't enough.
  • Cost per conversation: tokens × price + tools.
  • Conversion (in commercial use): requested a quote, scheduled, purchased.

Stack in Autenticare projects

  • Vertex AI Evaluation: native, plugged directly into the Gemini Enterprise agent.
  • BigQuery: stores conversations, scores, metadata. Ad-hoc SQL.
  • Looker: quality and drift dashboards.
  • Cloud Run jobs: run the gold set daily, alert on regressions.
  • PagerDuty: human alert when a key metric drops below threshold.
  • Weekly notebook: consultant does a deep dive on 50 real conversations and produces a report.

Minimum checklist before going to production

1
Gold set > 100 cases, 2+ reviewers

Diversity of intents, edge cases, ambiguities.

2
Automated pipeline at every deploy

Gold set run before any promotion to production.

3
4 metrics with thresholds

Faithfulness, relevance, completeness, safety. Below minimum → blocks deploy.

4
Live dashboard + alert

Regression > 5% on any metric triggers PagerDuty.

5
Weekly human review + owner with pause authority

50–100 sampled conversations and a named person with authority to stop the agent.

If any item is missing, the agent is not ready.

Quality audit

Is your agent already in production without formal evaluation?

Autenticare runs a 2-week audit: builds the initial gold set, configures 4 metrics, installs a drift dashboard. We deliver the loop running, not just a report.


Also read