Evaluating AI agents in production: how to measure quality without fooling yourself
Without continuous evaluation, AI agents degrade and no one notices — until the customer complains. The evaluation framework we use in all Gemini Enterprise projects: gold set, metrics, monitoring and drift.
Fabiano Brito
CEO & Founder
In traditional software, "passed tests" is binary. In AI, it's a distribution: 92% of cases pass, 8% degrade, and today's degraded case can become tomorrow's catastrophic one. That's why evaluation is not a phase — it's a continuous loop.
4 evaluation dimensions
Faithfulness
Is the response grounded in the retrieved data? This is the anti-hallucination metric.
How to measure: LLM-as-judge (Gemini Pro evaluating response vs context) + weekly human sampling.
Relevance
Does the response address the question? Penalizes correct but off-topic answers.
How to measure: embedding similarity + LLM-as-judge.
Completeness
Does it cover all relevant aspects? Critical for multi-part questions.
How to measure: human rubric or LLM-as-judge with explicit criteria.
Safety
Does it avoid prohibited content (PII leakage, bias, inappropriate recommendations)?
How to measure: specific classifier + rule-based + human sampling.
Ignoring one dimension = surprise in production. Covering all 4 = a solid foundation.
Gold set: the most underrated asset
A gold set is the collection of (question, expected answer) pairs curated by humans. It's what separates real evaluation from "guesswork".
How to build one
- Minimum size: 50–100 cases for pilot, 300–500 for production, 1,000+ for critical systems.
- Diversity: cover all main intents, known edge cases, ambiguities.
- Expected answer goes beyond content: desired format, citations, tone.
- Multi-reviewer annotation: two humans per case, with a third arbitrating disagreements.
- Versioning: each version has a hash + date + owner.
How to maintain it
- Every real regression in production becomes a new case in the gold set.
- Every business rule change triggers a review of affected cases.
- Quarterly review: remove obsolete cases, add new scenarios.
Without a gold set, you're iterating prompts while looking at 3 examples in Slack. It's the difference between engineering and guessing.
When to evaluate
| Moment | What to run |
|---|---|
| Prompt change | Full automated gold set |
| Model upgrade (2.5 → 3) | Gold set + human review of 100 cases |
| RAG change (chunking, reranker) | Retrieval-focused gold set |
| Daily in production | Sample of 50–100 conversations |
| Weekly | Drift analysis, top categories with decline |
| Monthly | Deep human review of 200 cases |
| Quarterly | Bias audit, fairness by segment |
LLM-as-judge: using it well
LLM evaluating LLM has known biases (favors longer responses, same-model outputs, assertive tone). To use it well:
- Different model from the one being tested whenever possible.
- Explicit rubric: numbered criteria, not "is it good?".
- Calibration against human: for every 200 cases, 20 with parallel human review. Agreement < 80% → revise the rubric.
- Multiple rounds: 3 evaluations with different seeds, aggregation by median.
- Cite the passage: the judge needs to explain why it scored that way — makes auditing easier.
Drift: the silent killer
How to detect it:
- Monitor embedding distribution of questions: if the cluster shifts, alert.
- Distribution of tool calls: if a previously used tool disappears, investigate.
- Latency by intent: sudden increase = behavior change.
- Fallback rate (agent says "I don't know"): if it rises, RAG is losing coverage.
- Human channel complaints: lagging indicator but reliable.
Product metrics (not just model metrics)
Evaluating an agent without product metrics means optimizing for the wrong thing:
- Autonomous resolution rate: % of conversations ended without a human.
- Post-conversation CSAT: single question at the end.
- Resolution time: compared to the human baseline.
- Return rate: user came back within 24h with the same question = initial answer wasn't enough.
- Cost per conversation: tokens × price + tools.
- Conversion (in commercial use): requested a quote, scheduled, purchased.
Stack in Autenticare projects
- Vertex AI Evaluation: native, plugged directly into the Gemini Enterprise agent.
- BigQuery: stores conversations, scores, metadata. Ad-hoc SQL.
- Looker: quality and drift dashboards.
- Cloud Run jobs: run the gold set daily, alert on regressions.
- PagerDuty: human alert when a key metric drops below threshold.
- Weekly notebook: consultant does a deep dive on 50 real conversations and produces a report.
Minimum checklist before going to production
Diversity of intents, edge cases, ambiguities.
Gold set run before any promotion to production.
Faithfulness, relevance, completeness, safety. Below minimum → blocks deploy.
Regression > 5% on any metric triggers PagerDuty.
50–100 sampled conversations and a named person with authority to stop the agent.
If any item is missing, the agent is not ready.
Is your agent already in production without formal evaluation?
Autenticare runs a 2-week audit: builds the initial gold set, configures 4 metrics, installs a drift dashboard. We deliver the loop running, not just a report.
