# Agent Evaluation as an Engineering Discipline

AI agent evaluation has moved beyond ad hoc prompt engineering into a full engineering discipline with versioned datasets, automated quality gates, and regression traceability. Bedrock AgentCore materializes that shift by bringing managed infrastructure to the agent testing lifecycle. For financial-grade systems architects, this changes the contract between ML teams and platform engineering.

- URL: https://fernando.moretes.com/blog/avaliacao-agentes-bedrock-agentcore-datasets

- Markdown: https://fernando.moretes.com/blog/avaliacao-agentes-bedrock-agentcore-datasets/article.md?lang=en

- Published: 2026-06-06T09:55:00.000Z

- Category: AI & Agents

- Tags: bedrock-agentcore, agent-evaluation, ai-testing, quality-gates, mlops, financial-grade, devops, observability

- Reading time: 8 min

- Source: [Build a test suite that grows with your agent using Bedrock AgentCore](https://aws.amazon.com/blogs/machine-learning/)

---

For years, evaluating AI systems in production was treated as a data science problem — a Jupyter notebook run before deploy, offline accuracy metrics, and the hope that production behavior would mirror what was measured in the lab. AI agents broke that contract entirely. An agent that orchestrates tool calls, maintains state across turns, and makes branching decisions cannot be evaluated with a single F1 score. The launch of Bedrock AgentCore — and specifically its capability to build test suites that grow with the agent — is the clearest signal yet that the industry is recognizing agent evaluation as a first-class engineering discipline, not an MLOps afterthought.

## Why the moment is now

- **~40%** — of production incidents in AI agents traced to undetected behavioral. Empirical data from internal LLMOps postmortems — silent regressions are the dominant failure mode
- **10–30×** — latency cost of manual evaluation versus automated evaluation with versioned. Manual evaluation of a multi-turn agent across 50 cases can take days; automated pipelines run in minutes
- **3 camadas** — of observability required for agents: tool trace, memory fidelity, and intent. No single metric covers all three — test suites must compose specialized evaluators

## The Signal: From Vibe-Testing to Evaluation Engineering

The dominant pattern until recently was what I call *vibe-testing*: an engineer runs the agent manually against a handful of representative prompts, observes whether the output "feels right", and approves the deploy. That works exactly until it doesn't — and in financial systems, that moment tends to be expensive.

Bedrock AgentCore changes the frame by introducing managed infrastructure for three problems that were previously solved with duct tape: (1) **evaluation dataset management** with versioning and lineage, (2) **evaluator execution** that can be LLM-as-judge, deterministic heuristics, or a composition of both, and (3) **CI/CD pipeline integration** via quality gates that block version promotion when scores fall below configurable thresholds.

The underlying architecture is revealing: by separating the *dataset store* from the *evaluator runtime* and the *agent under test*, AgentCore allows teams to evolve each component independently — a principle any data architect will recognize from Data Mesh. Evaluation datasets become data products with owners, SLOs, and schema contracts. This is not accidental; it is the maturity signal that separates experimentation platforms from production platforms.

## Why Agents Are Fundamentally Different from Models

When we evaluate a classification model, the output space is discrete and the loss function is well-defined. When we evaluate an agent, we are evaluating a *process* — a sequence of decisions, tool calls, state updates, and intermediate responses that culminate in a final result. Failure can occur at any point in that chain, and the final output can appear correct even when the path was entirely wrong.

Consider a financial reconciliation agent that uses tools to query balances, validate transactions, and generate reports. A test that only checks the final report misses critical failures: the agent may have called the wrong tool, used incorrect parameters, or taken a shortcut that works for the test case but fails on production edge cases. This is what I call **trace fidelity** — the ability to verify not just the outcome, but the path.

AgentCore addresses this by instrumenting the complete agent execution lifecycle with OpenTelemetry spans, making every tool call, every routing decision, and every memory update observable and evaluable. For financial systems subject to audit, this is not optional — it is the difference between a system you can defend before a regulator and one you hope is never questioned.

## Agent Evaluation Pipeline with Bedrock AgentCore

Full flow from agent code change to CI/CD quality gate, through composed evaluator execution and regression traceability

### 🔧 Dev & CI — Source

- Agent Code PR / commit (user)
- CodePipeline / GitHub Actions (ci)

### 🟧 AWS — Bedrock AgentCore

- AgentCore Eval Orchestrator (ai)
- Dataset Store versioned + lineage (storage)
- Agent Under Test Bedrock Runtime (ai)

### ⚖️ AWS — Evaluator Runtime

- LLM-as-Judge Claude 3.5 Sonnet (ai)
- Deterministic Heuristic Evals (compute)
- Trace Fidelity Evaluator (OTel) (compute)

### 📊 AWS — Observability & Gate

- CloudWatch Eval Metrics + Alarms (data)
- S3 Eval Results + Audit Log (storage)
- Quality Gate threshold check (security)

### Flows

- dev -> ci: push / PR
- ci -> agentcore: trigger evaluation
- agentcore -> dataset: load versioned dataset
- agentcore -> agent_ut: run test cases
- agent_ut -> llm_judge: output → judgment
- agent_ut -> heuristic: output → rules
- agent_ut -> trace_eval: OTel spans → fidelity
- llm_judge -> agentcore: score
- heuristic -> agentcore: score
- trace_eval -> agentcore: score
- agentcore -> cw: eval metrics
- agentcore -> s3_results: results + lineage
- agentcore -> gate: composite score
- gate -> ci: pass / block deploy

## What changes for architects with this signal

- **Evaluation datasets are engineering artifacts**: they need semantic versioning, schema contracts (e.g., JSON Schema or Avro), and a product owner — they cannot live in a shared Google Sheet.
- **Agent quality gates block version promotion**: the same feature flag and canary deployment patterns we use for services must be applied to agent versions, with automatic rollback when evaluation scores regress beyond a threshold (e.g., >5% drop in tool fidelity).
- **LLM-as-judge requires calibration and auditability**: using Claude 3.5 Sonnet as a judge of another Bedrock agent introduces a dependency loop that needs managing — the judge model must be pinned to a specific version and its judgments must be sampled and human-audited periodically.
- **Trace traceability is a regulatory requirement in finance**: for agents touching financial data, every tool call, routing decision, and data access must be traceable with trace ID correlation — OpenTelemetry with export to CloudWatch X-Ray or Datadog is the minimum standard.
- **Evaluation cost is a design variable**: running Claude 3.5 Sonnet as judge across 500 test cases per PR can cost $15–40 depending on context size — architects must size evaluation suites with cost awareness, using deterministic evaluators as a primary filter and LLM-as-judge only for ambiguous cases.
- **Synthetic datasets need financial edge case coverage**: automatic test case generation via LLM tends to under-represent high-impact edge cases (e.g., negative-value transactions, exotic currencies, idempotency failures) — financial domain teams must actively contribute edge cases to the dataset store.

## Positioning for Financial Systems: The Minimum Contract

In financial systems, tolerance for silent failures is close to zero. An agent that errs in 2% of cases in a consumer product may be acceptable; the same agent processing FX reconciliations or managing credit limits cannot be. This means the minimum contract for agent evaluation in financial contexts is significantly more demanding than the industry default.

What I advocate as the minimum contract for financial production: **first**, evaluation datasets must cover at least three categories — nominal cases (happy path), domain edge cases (e.g., multi-currency transactions, idempotency failures, tool timeouts), and adversarial cases (prompt injection, data exfiltration attempts via tool). **Second**, each test case must have a deterministic evaluator for the final outcome AND a trace evaluator to verify the execution path was correct. **Third**, the quality gate must be configured with differentiated thresholds by category — an agent may have 98% accuracy on nominal cases but still fail the gate if it scores below 100% on adversarial cases.

At the AWS configuration level: use IAM conditions with `bedrock:InferenceProfileIdentifier` to ensure the agent under test and the judge agent use separate inference profiles with isolated quotas — you do not want an evaluation spike throttling your production agent. Configure distinct KMS CMKs for the dataset store (S3 with SSE-KMS) and evaluation results, with separate access policies for the ML team and auditors.

## Real Failure Modes and How to Instrument Them

Across engagements with teams building agents in production, three failure modes appear repeatedly and are systematically under-covered by immature evaluation suites.

**Tool idempotency failure**: an agent that retries a tool call after a timeout may execute the same operation twice if the tool is not idempotent. In financial contexts, this can mean duplicate debits. The trace evaluator must verify the agent did not emit duplicate tool calls for non-idempotent operations, using the trace span to count invocations by `tool_name` and `correlation_id`. Configure a CloudWatch alarm on `EvalMetric/DuplicateToolCall` with a zero threshold.

**Memory drift across turns**: in multi-turn agents, accumulated state can introduce bias into later decisions — the agent "remembers" a previous transaction and applies incorrect logic to a new one. Evaluation datasets for this failure mode require multi-turn conversations with deliberately contrasting state between turns. The LLM-as-judge evaluator must be explicitly instructed to verify that the turn-N response is independent of irrelevant context from turn N-2.

**Privilege escalation via tool**: an agent with access to multiple tools can be induced to combine calls in ways that result in access to data beyond its intended scope. This is particularly critical when tools have access to internal APIs with delegated authentication. The adversarial evaluator must include cases that attempt to induce the agent to use tools outside the intended flow, and the quality gate must block any version that fails these cases — no exceptions.

> **The LLM Judge Paradox in Financial Production:** LLM-as-judge is powerful for evaluating semantic quality — coherence, relevance, tone — but introduces non-determinism into the evaluation process itself. In financial systems where auditability is mandatory, you need an immutable log of every judgment, including the exact prompt sent to the judge, the judge model version, and the raw response. Store this in S3 with Object Lock (WORM) and configure a SHA-256 hash of each judgment in DynamoDB for integrity verification. Without this, you have an evaluation system you cannot audit yourself.

## Anti-Patterns I See Repeatedly

- **Static dataset as the sole evaluation set**: using the same 50 test cases for 6 months while the agent evolves — the dataset must grow with the agent, especially incorporating production failure cases as new regression tests.
- **Evaluating only the final output**: ignoring the execution trace and evaluating only the final response — in financial agents, the path matters as much as the destination.
- **Single threshold quality gate across all categories**: a 95% accuracy threshold applied equally to nominal and adversarial cases is false security — high-risk categories need stricter thresholds.
- **Judge model not version-pinned**: using `claude-3-5-sonnet-latest` as judge means a model update can change evaluation scores without any change to the agent — always pin the judge model to a specific version (e.g., `claude-3-5-sonnet-20241022`).
- **Evaluation decoupled from the deploy pipeline**: running evaluations manually or in a separate pipeline without integration with the version promotion gate — evaluation must be a deploy blocker, not an informational metric.

## Well-Architected Lenses for Agent Evaluation

- **security**: Isolate inference quotas between production agent and agent under test via separate inference profiles. Use distinct KMS CMKs for dataset store and evaluation results. Store LLM judgments with S3 Object Lock for regulatory auditability. Apply IAM conditions `bedrock:InferenceProfileIdentifier` to prevent context cross-contamination.
- **reliability**: Configure retry with exponential backoff and jitter in the evaluator runtime to handle Bedrock Runtime throttling. Implement a circuit breaker in the quality gate — if the evaluator runtime fails, the gate must fail closed (block deploy), not open. Maintain at least two independent evaluators per risk category for judgment redundancy.
- **performance**: Use deterministic evaluators as a primary filter to reduce LLM-as-judge call volume by 60–80%. Parallelize test case execution with Step Functions Map state (configurable max concurrency). Consider Bedrock batch inference for large suites — can reduce cost by up to 50% versus synchronous invocation.
- **cost**: Size evaluation suites with cost awareness: deterministic evaluators first, LLM-as-judge only for ambiguous cases. Use S3 Intelligent-Tiering for historical evaluation datasets. Configure CloudWatch Budget Alerts for evaluation inference cost separate from production cost.

> **Curator's Note:** What I would concretely do: start with a minimum dataset of 30 cases — 10 nominal, 10 financial domain edge cases, 10 adversarial — and a quality gate with differentiated thresholds per category before any production deploy. The lesson I learned the hard way is that evaluation suites that do not grow with the agent create a false sense of security that is worse than no evaluation at all, because you stop questioning the system. The second critical point: pin the judge model to a specific version from day one and document this as an ADR — silent score drift from model updates is one of the hardest problems to diagnose in retrospect. And finally: agent evaluation is not the exclusive responsibility of the ML team — in financial systems, the domain team must be co-authors of the edge case datasets.

## Verdict: Adopt Now, with Governance

Bedrock AgentCore represents a real inflection in the maturity of AI agent platforms — not because it is the only possible solution, but because it is the clearest signal yet that the industry is converging on agent evaluation as an engineering discipline with dedicated infrastructure. For teams in financial systems, the recommendation is clear: adopt the pattern now, but do so with explicit governance. This means: evaluation datasets as versioned artifacts with product owners, quality gates that block deploy with differentiated thresholds by risk category, full trace traceability via OpenTelemetry, judge model version-pinned with judgments stored immutably for audit, and a formal process for incorporating production failures as new test cases. Teams that build this discipline now will have a significant operational advantage as agents take on higher-consequence responsibilities. Teams that do not will discover the cost of that omission at an inopportune moment.

**Rating:** Adopt with governance

## References

- [AWS Bedrock AgentCore — Agent Evaluation](https://aws.amazon.com/blogs/machine-learning/)
- [Amazon Bedrock — Inference Profiles](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-profiles.html)
- [AWS Well-Architected — Machine Learning Lens](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/welcome.html)
- [OpenTelemetry — Semantic Conventions for GenAI](https://opentelemetry.io/docs/specs/semconv/gen-ai/)
- [LangSmith — LLM Evaluation Platform](https://docs.smith.langchain.com/)
- [NIST AI RMF — AI Risk Management Framework](https://airc.nist.gov/RMF)
- [Amazon Bedrock — Batch Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html)
- [AWS Step Functions — Map State](https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html)