Evaluating AI agents in production: how to measure quality without fooling yourself
Without continuous evaluation, AI agents degrade and no one notices — until the customer complains. The evaluation framework we use in all Gemini Enterprise projects: gold set, metrics, monitoring and drift.
Fabiano Brito
CEO & Founder
Evaluating AI agents in production is a continuous loop of measuring performance across key dimensions like faithfulness, relevance, completeness, and safety. Because these non-deterministic systems can drift, regress, and break silently on edge cases, formal evaluation is critical to avoid operating in the dark.
In traditional software, "passed tests" is binary. In AI, it's a distribution: 92% of cases pass, 8% degrade, and today's degraded case can become tomorrow's catastrophic one. That's why evaluation is not a phase — it's a continuous loop.
4 evaluation dimensions
Faithfulness
Is the response grounded in the retrieved data? This is the anti-hallucination metric.
How to measure: LLM-as-judge (Gemini Pro evaluating response vs context) + weekly human sampling.
Relevance
Does the response address the question? Penalizes correct but off-topic answers.
How to measure: embedding similarity + LLM-as-judge.
Completeness
Does it cover all relevant aspects? Critical for multi-part questions.
How to measure: human rubric or LLM-as-judge with explicit criteria.
Safety
Does it avoid prohibited content (PII leakage, bias, inappropriate recommendations)?
How to measure: specific classifier + rule-based + human sampling.
Ignoring one dimension = surprise in production. Covering all 4 = a solid foundation.
Gold set: the most underrated asset
A gold set is the collection of (question, expected answer) pairs curated by humans. It's what separates real evaluation from "guesswork".
How to build one
- Minimum size: 50–100 cases for pilot, 300–500 for production, 1,000+ for critical systems.
- Diversity: cover all main intents, known edge cases, ambiguities.
- Expected answer goes beyond content: desired format, citations, tone.
- Multi-reviewer annotation: two humans per case, with a third arbitrating disagreements.
- Versioning: each version has a hash + date + owner.
How to maintain it
- Every real regression in production becomes a new case in the gold set.
- Every business rule change triggers a review of affected cases.
- Quarterly review: remove obsolete cases, add new scenarios.
Without a gold set, you're iterating prompts while looking at 3 examples in Slack. It's the difference between engineering and guessing.
When to evaluate
| Moment | What to run |
|---|---|
| Prompt change | Full automated gold set |
| Model upgrade (2.5 → 3) | Gold set + human review of 100 cases |
| RAG change (chunking, reranker) | Retrieval-focused gold set |
| Daily in production | Sample of 50–100 conversations |
| Weekly | Drift analysis, top categories with decline |
| Monthly | Deep human review of 200 cases |
| Quarterly | Bias audit, fairness by segment |
LLM-as-judge: using it well
LLM evaluating LLM has known biases (favors longer responses, same-model outputs, assertive tone). To use it well:
- Different model from the one being tested whenever possible.
- Explicit rubric: numbered criteria, not "is it good?".
- Calibration against human: for every 200 cases, 20 with parallel human review. Agreement < 80% → revise the rubric.
- Multiple rounds: 3 evaluations with different seeds, aggregation by median.
- Cite the passage: the judge needs to explain why it scored that way — makes auditing easier.
Drift: the silent killer
How to detect it:
- Monitor embedding distribution of questions: if the cluster shifts, alert.
- Distribution of tool calls: if a previously used tool disappears, investigate.
- Latency by intent: sudden increase = behavior change.
- Fallback rate (agent says "I don't know"): if it rises, RAG is losing coverage.
- Human channel complaints: lagging indicator but reliable.
Product metrics (not just model metrics)
Evaluating an agent without product metrics means optimizing for the wrong thing:
- Autonomous resolution rate: % of conversations ended without a human.
- Post-conversation CSAT: single question at the end.
- Resolution time: compared to the human baseline.
- Return rate: user came back within 24h with the same question = initial answer wasn't enough.
- Cost per conversation: tokens × price + tools.
- Conversion (in commercial use): requested a quote, scheduled, purchased.
Stack in Autenticare projects
- Vertex AI Evaluation: native, plugged directly into the Gemini Enterprise agent.
- BigQuery: stores conversations, scores, metadata. Ad-hoc SQL.
- Looker: quality and drift dashboards.
- Cloud Run jobs: run the gold set daily, alert on regressions.
- PagerDuty: human alert when a key metric drops below threshold.
- Weekly notebook: consultant does a deep dive on 50 real conversations and produces a report.
Minimum checklist before going to production
Diversity of intents, edge cases, ambiguities.
Gold set run before any promotion to production.
Faithfulness, relevance, completeness, safety. Below minimum → blocks deploy.
Regression > 5% on any metric triggers PagerDuty.
50–100 sampled conversations and a named person with authority to stop the agent.
If any item is missing, the agent is not ready.
Frequently Asked Questions sobre Evaluating AI agents in production: how to measure quality without fooling yourself
Why is continuous evaluation important for AI agents in production? AI agents are non-deterministic and can exhibit drift, regression, and silent failures. Continuous evaluation ensures that quality is monitored and maintained over time.
What are the four main dimensions for evaluating AI agents? The four dimensions are: Faithfulness (whether the answer is grounded in the data), Relevance (whether the answer addresses the question), Completeness (whether it covers all relevant aspects), and Safety (whether it avoids prohibited content).
What is a ‘gold set’ and why is it important? A ‘gold set’ is a set of (question, expected answer) pairs curated by humans. It is essential for accurate and objective evaluation of AI agents.
How often should I run evaluations in production? It is recommended to run evaluations daily with conversation samples, weekly for drift analysis, monthly with human review, and quarterly for bias auditing.
Is your agent already in production without formal evaluation?
Autenticare runs a 2-week audit: builds the initial gold set, configures 4 metrics, installs a drift dashboard. We deliver the loop running, not just a report.
