Amazon Bedrock AgentCore: Continuous Agent Optimization in Production
Listen to article
Fernando's voiceFernando · 21:10
Powered by Amazon Polly + OmniVoice
Amazon Bedrock AgentCore introduces a continuous improvement loop that turns production traces into actionable diagnostics, data-grounded recommendations, and statistical validation via A/B testing. For architects of financial systems and high-stakes platforms, this represents AWS's first serious attempt to close the gap between agent observability and reliable production operation.
The most dangerous problem in agentic systems is not the agent that fails with a visible stack trace — it is the agent that responds confidently, looks fine on dashboards, and silently delivers wrong answers to hundreds of users for weeks. Amazon Bedrock AgentCore, with its new continuous optimization capabilities announced in June 2026, attacks that blind spot directly. As an architect who has spent years designing financial systems where silently incorrect behavior can result in regulatory losses or customer harm, I look at this feature with productive technical skepticism — and genuine interest.
The Problem in Numbers
What the AgentCore Continuous Optimization Loop Actually Is
AgentCore is not simply a more sophisticated logging layer. It is an attempt to close the complete MLOps cycle for agents: observe → diagnose → recommend → validate → promote. Each step has a concrete technical counterpart.
Observe starts with collecting production traces at scale — not trace-by-trace, but aggregated analysis of hundreds of sessions simultaneously. Failure Insights identify recurring failure patterns, including so-called silent behavioral failures: cases where the agent apparently completes the task, but the result is wrong, incomplete, or outside the expected scope. Intent Insights cluster requests by user intent, and Trajectory Insights group the paths the agent takes — revealing trajectory deviations that no p99 latency dashboard would catch.
Diagnose and Recommend is where the system goes beyond the observable. Generated recommendations analyze traces and evaluation outputs to suggest specific changes to system prompts and tool descriptions. This is significantly different from a generic suggestion: each recommendation comes with a rationale traceable to observed production failures.
Validate via batch evaluation against a team-defined dataset, with aggregate scores from multiple evaluators — and then confirm via A/B testing with statistically split live traffic. This is the point where reliability engineering meets agent operations.
AgentCore Continuous Optimization Loop
Full flow from agent execution in production through to validated promotion of a new version, passing through insights, recommendations, and statistical validation.
- AgentCore Runtime · managed execution
- AWS Lambda · custom agent logic
- Amazon EKS · containerized agents
- Production Traces · sessions at scale
- Failure Insights · silent + error patterns
- Intent Insights · user intent clusters
- Trajectory Insights · path grouping + outliers
- Recommendations · system prompt + tool desc
- Batch Evaluation · multi-evaluator scoring
- A/B Testing · statistical traffic split
- New Agent Version · validated + promoted
Where AgentCore Genuinely Shines: The Silent Failure Problem
In financial systems, the concept of silent failure is not new — it is the nightmare of any SRE team. A service that returns HTTP 200 with a semantically incorrect payload is infinitely more dangerous than one that returns 500. With AI agents, this problem amplifies: the agent may complete all tool calls, return a well-formatted response, and still have misinterpreted the user's intent, omitted a critical compliance step, or generated a financial recommendation outside the authorized scope.
What AgentCore's Failure Insights do differently is analyze behavioral patterns at scale — hundreds of sessions simultaneously — to identify where the agent systematically deviates from expected behavior, even without explicit errors. This is analogous to what we do with distributed tracing when hunting for latency anomalies in distribution tails: you do not find the problem looking at one trace at a time.
Ranking failures by impact (how many users are affected) is a smart product decision. In a financial environment with SLOs defined by customer segment, this maps directly to incident prioritization. A bug affecting 0.1% of high-value transactions is more critical than one affecting 5% of low-risk queries — and the system needs to know that.
The Intent Insights capability also deserves attention: by clustering what users actually try to do versus what the system was designed to support, you get a continuous gap analysis of your agentic product-market fit. This is product observability, not just technical observability.
Strengths of the Approach
Agent A/B Testing: Applied Reliability Engineering
A/B testing of agents is conceptually more complex than A/B testing of traditional software features, and it is important to understand why. In a UI A/B test, you measure a discrete metric — click rate, conversion, time on page. In an agent, you are measuring the quality of a generated response, which is inherently subjective and multidimensional: factual accuracy, scope adherence, reasoning quality, policy compliance.
AgentCore resolves this with a multi-evaluator system that collectively defines what 'good' means for that specific agent. This is analogous to what we do with composite SLOs in financial systems: you do not have a single availability SLO, you have SLOs by transaction type, by customer segment, by channel. The composition of those indicators is what defines the real health of the system.
Traffic splitting in production for agents raises engineering questions worth attention. Unlike a UI split, where user state is relatively isolated, an agent may maintain session context, access external tools with side effects, and operate in multi-step workflows. This means the A/B test design must consider: idempotency of tool calls, context isolation between versions, and the impact of side effects on downstream systems.
For financial environments, there is an additional layer: any change to an agent that makes credit decisions, generates investment recommendations, or processes transactions may have regulatory implications. A/B testing needs to be documented as part of the change management process, with complete traceability of which version made which decision for which user — something the AgentCore trace store should natively support.
Real Limits and Architectural Risks
1. Insights still in preview: Failure, intent and trajectory insights are in preview across 13 regions — not GA. For regulated financial workloads, preview means no SLA, no production support, and no API stability guarantees. Do not build compliance pipelines on top of preview features. 2. The optimization loop is only as good as your evaluators: Batch evaluation measures candidates against a team-defined dataset and criteria. If your evaluators do not cover financial compliance edge cases, the system will approve changes that look good in tests but fail in production under regulatory scenarios. Garbage in, garbage out — but now with statistical confidence. 3. A/B testing with side effects is dangerous without isolation: If your agent executes tool calls with real side effects (writes to DynamoDB, calls payment APIs, sends notifications), traffic splitting can create inconsistent state between versions. You need tool call idempotency keys and explicit context isolation before enabling A/B testing. 4. Automatic system prompt recommendations require human review in regulated environments: The rationale is derived from production data, but the proposed change is still generated by a model. In financial systems under BACEN, CVM or international equivalents, changes to prompts that affect credit decisions or investment recommendations need documented human approval — AgentCore does not replace that process. 5. Trace cost at scale: Storing and analyzing hundreds of sessions continuously has cost. Without a sampling strategy (e.g., tail-based sampling with OpenTelemetry), observability cost can exceed agent execution cost in high-volume workloads.
Integration with Existing Financial Architectures: What Actually Matters
The decision to make AgentCore runtime-agnostic is strategically correct and architecturally important. Most financial organizations experimenting with AI agents will not migrate all execution logic to a new managed runtime — they have agents running on Lambda with legacy business logic, on EKS with custom orchestrators, or in hybrid environments with data sovereignty constraints.
This means AgentCore's value layer is observability and optimization, not execution. And that is a much more defensible product position long-term. You instrument trace collection in your existing runtime, and the optimization loop works regardless of where the agent runs.
For integration with existing financial data pipelines, the point of attention is the trace data model. If you already have distributed tracing with OpenTelemetry and Datadog, you need to understand how AgentCore traces relate to your existing spans. The recommendation is to maintain trace context propagation (W3C TraceContext) between AgentCore and your downstream systems, so that an agent trace can be correlated with the database transaction, the market data API call, or the MSK event it generated.
From an IAM perspective, permissions for AgentCore to access production traces and run analyses must follow least privilege with specific conditions: bedrock:GetAgentTrace and bedrock:AnalyzeAgentBehavior should be scoped by aws:ResourceTag/Environment to ensure the staging optimization pipeline does not access production traces. KMS customer-managed keys for traces containing customer data are mandatory in any regulated financial environment.
How to Adopt in Financial Environments: Recommended Sequence
- 1
1. Audit your current trace model
Before enabling any AgentCore feature, map what is in your agent traces today: do they contain PII? Transaction data? Prompt content with sensitive information? Define a redaction policy and implement it in the trace emitter before connecting to AgentCore. Use KMS CMK with key policy that restricts access to the AgentCore role via
kms:ViaServicecondition. - 2
2. Start with batch evaluation in staging
Batch evaluation is GA and is the safest building block. Build a representative evaluation dataset with real use cases, including compliance edge cases (e.g., attempts to obtain recommendations outside authorized scope). Define your evaluators with explicit, measurable criteria. This establishes the baseline before any optimization.
- 3
3. Enable Failure Insights in preview with limited scope
As it is in preview, enable only for a subset of non-critical agents first. Configure cost monitoring with AWS Budgets for the AgentCore observability namespace. Validate that identified failure patterns match what your team already knows — this calibrates confidence in the system before using it for discovery.
- 4
4. Implement A/B testing with side effect isolation
Before enabling traffic splitting, implement idempotency keys on all tool calls with external side effects. Use DynamoDB conditional writes with
attribute_not_exists(idempotency_key)to ensure an action is not executed twice if the same request hits different versions during the split. Document the test period and criteria as an ADR for audit purposes. - 5
5. Formalize the recommendation approval process
Create a documented process (can be a GitHub PR with a specific template) for human review of each AgentCore-generated recommendation before applying it. For agents affecting financial decisions, require approval from a senior architect and the compliance team. Record the original AgentCore rationale and the human decision in the same ADR.
Well-Architected Pillars Analysis
Security
Agent traces may contain PII, transaction data and sensitive prompt content. Require KMS CMK with restricted key policy, implement redaction in the trace emitter, and scope IAM permissions by environment tag. A/B testing with access to production data requires access controls equivalent to the main system.
Reliability
The optimization loop is an auxiliary system — its failure must not impact primary agent execution. Implement circuit breakers between the agent runtime and the AgentCore trace collector. A/B testing with side effects requires idempotency keys on all external tool calls to prevent duplicate actions during traffic splits.
What Is Still Missing: Gaps That Matter for Serious Production
Despite significant progress, there are gaps that need to be addressed before I recommend AgentCore as the observability backbone for agents in high-criticality financial environments.
No native OpenTelemetry integration: The modern observability ecosystem converges on OTel. If AgentCore emits traces in a proprietary format without native OTel exporters, you create an observability silo — agent traces separated from infrastructure traces, without automatic correlation. For environments that already have Datadog or Grafana as the observability control plane, this is a real operations problem.
Cost model not yet fully documented: AWS has not published detailed pricing for insights analysis at scale. For an architecture team making adoption decisions, the absence of concrete cost numbers per analyzed session, per generated recommendation, or per hour of active A/B testing is a barrier to business case.
Governance of automatic recommendations: The system generates system prompt change recommendations derived from production data. For organizations with formal change management processes (ITIL, ISO 27001, SOX), it is not clear how these recommendations fit into the existing approval workflow. We need integration with ticketing systems (Jira, ServiceNow) and approval workflow support before this is viable in regulated enterprises.
Volume limits for insights analysis: Current documentation does not specify session limits per analysis, insights processing latency, or behavior under high concurrency. For financial platforms with predictable traffic spikes (market open, contract expiration), these limits are critical for capacity planning.
If I were adopting this today in a financial environment, I would start exclusively with batch evaluation in GA — it is the most mature component and the one that delivers immediate value without preview risks. The lesson I learned operating high-stakes systems is that the biggest risk is not the new technology itself, but premature trust in it: teams that adopt agent A/B testing without first solving tool call idempotency will create inconsistent production states that are extremely difficult to diagnose. I would also insist on maintaining a parallel observability plan with OTel and Datadog until AgentCore's native integration with the existing trace ecosystem is documented and tested — two observability planes are better than one opaque silo. The concept of closing the MLOps loop for agents is correct and necessary; the execution is on the right track, but still requires operational maturity before anchoring a regulated system.
AgentCore vs. DIY Agent Observability Approach
| Dimension | AgentCore (Managed) | DIY (OTel + Datadog + Scripts) | |
|---|---|---|---|
| Silent failure detection | Native, pattern analysis at session scale | Requires custom behavioral analysis implementation | — |
| Improvement recommendations | Automatically generated with traceable rationale | Manual, based on human trace analysis | — |
| Version A/B testing | Native with traffic split and statistical evidence | Requires custom feature flags and manual statistical analysis | — |
| OTel ecosystem integration | Limited / not natively documented | Full — OTel is the standard for the DIY approach | — |
| Initial implementation cost | Low — managed by AWS | High — requires dedicated platform engineering | — |
| Control and auditability | Medium — depends on AWS APIs for data access | High — data and logic fully under team control | — |
References
Verdict: Promising, But Operational Maturity Still Being Built
Amazon Bedrock AgentCore represents the most coherent approach AWS has ever launched for the real problem of operating AI agents in production: not just running them, but understanding what they are doing, identifying where they fail silently, and improving them with statistical rigor. The observe → diagnose → recommend → validate → promote loop is architecturally correct and addresses the gap that every team operating agents in production feels. For high-criticality financial environments, my recommendation is phased adoption: start with batch evaluation in GA today to establish quality baselines; pilot Failure Insights in preview on non-critical agents to calibrate confidence; adopt A/B testing only after solving tool call idempotency and formalizing the recommendation approval process. Do not try to do everything at once. What prevents me from giving an unrestricted recommendation is the combination of: insights still in preview (no SLA), absence of documented OTel integration, unpublished cost model at scale, and governance gaps for regulated environments. These are not philosophical objections — they are real operational requirements that need answers before fleet-wide adoption in financial systems. Potential: high. Maturity for regulated financial production: medium-high for GA components, medium for preview components. Watch this space closely — it will evolve rapidly over the next few quarters.
Architecture, AWS, AI and market deep dives — straight to your inbox. Free.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this article from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.
Keep reading
Architecture intelligence, in your inbox
Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.
- Curated AWS · AI · architecture · market signals
- New architecture studies & deep-dives when they ship
- Sharp summaries — depth without the noise
- No spam · double opt-in · unsubscribe anytime