Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

AI & AgentsDecision Record

ADR: Scaling Agents to Production with AgentCore Runtime Quotas

Jul 2, 2026 9 minexpert AI-assisted

Listen to article

Fernando's voice

Fernando · 17:14

Download MP3

0:0017:14

Speed

The MP3 is saved to S3 after the first play.

AI & AgentsDecision Record

5,000

Active concurrent sessions (us-east-1, us-west-2)

Default without quota increase request

2,500

Active concurrent sessions (other supported regions)

Available in all regions where AgentCore is available

200/s

Agent interactions per second (all regions)

New uniform regional default

fernando.moretes.com

In July 2026, AWS raised AgentCore Runtime default limits to 5,000 active concurrent sessions in us-east-1/us-west-2 and 200 interactions per second across all regions. This ADR documents the context that forced this design decision, the architectural options I evaluated for financial-grade agentic systems at scale, and the operational consequences you must plan for before putting agents into production.

Doubled default quotas aren't just a nice product headline — they're a signal that AWS is betting agentic workloads are going to production at scale. This ADR is about what changes architecturally when the ceiling rises, and what can still take you down.

Context and Forces: Why This Matters Now

Since Amazon Bedrock AgentCore launched, I've been closely tracking how financial engineering teams try to fit AI agents into workflows that demand auditability, controlled latency, and fault isolation. The recurring problem wasn't model capability — it was runtime infrastructure. Conservative session limits forced workaround architectures: SQS queues to serialize agent calls, Step Functions to manage session lifecycles outside the platform, and manual retry logic that duplicated what AgentCore itself should provide.

The July 1, 2026 announcement changes that calculus. With 5,000 active concurrent sessions in us-east-1 and us-west-2, and 2,500 in other supported regions, AgentCore becomes viable as a primary runtime for use cases that previously required hybrid solutions. The 200 agent interactions per second and 25 new sessions per second as defaults — without requiring a quota increase request — eliminate an entire category of operational friction when onboarding new workloads.

But raising the default ceiling doesn't solve the design problems that emerge when you actually use that capacity. In financial environments, every agent session carries sensitive context: customer data, transaction state, decision history. Horizontal session scaling amplifies the attack surface, observability complexity, and cost risk. The technical signal here isn't 'use more agents' — it's 'now you need a serious session control architecture'.

AgentCore Runtime New Default Limits (July 2026)

5,000

Active concurrent sessions (us-east-1, us-west-2)

Default without quota increase request

2,500

Active concurrent sessions (other supported regions)

Available in all regions where AgentCore is available

200/s

Agent interactions per second (all regions)

New uniform regional default

25/s

New sessions created per second (all regions)

Session creation rate — critical bottleneck under burst traffic

The Architectural Forces This Announcement Exposes

When you read '5,000 concurrent sessions', the first engineering reaction is: 'great, I can scale'. The second reaction — the one that matters — should be: 'what happens when I'm at 4,800 sessions and a burst of 300 simultaneous users tries to create new sessions at 25/s?'

The 25 new sessions per second limit is the real bottleneck in this architecture. In a financial system with market peaks — exchange open, settlement windows, credit events — the session creation rate can exceed this limit even with abundant headroom in the active session pool. This demands a session pre-warming pattern: creating sessions in advance and holding them in a managed pool, reusing them for incoming requests rather than creating a new session per user interaction.

The second pressure vector is context isolation. With 5,000 active sessions, each potentially carrying customer context in runtime memory, the question of where session state lives becomes critical. AgentCore manages session state internally, but for regulatory audit purposes in financial environments, you need an external, immutable audit trail. This means integrating AgentCore with DynamoDB (for projected session state with TTL and partition key on customerId#sessionId) and S3 with Object Lock for agent decision logs.

The third vector is cost. At 200 interactions/s sustained, with each interaction invoking a Bedrock model (Claude, Titan, or otherwise), token costs can scale non-linearly. Without per-session circuit breakers and without per-interaction token limits configured explicitly, a poorly instructed agent can consume a week's budget in hours.

Architectural Options for Agent Sessions at Scale

Option A: Session per Request (Stateless Pattern)

Pros

Simple to implement; no pool management
Complete isolation between requests

Cons

Exhausts the 25 new sessions/s limit quickly under burst
Session creation overhead adds significant P99 latency
No context reuse across interactions from the same user

Acceptable only for low-volume, low-frequency workloads

Option B: Session Pool with Pre-warming via Scheduled Lambda

Pros

Absorbs bursts without hitting the session creation limit
Controlled first-interaction latency (session already active)
Allows user context association to pre-warmed sessions

Cons

Pool management complexity; risk of orphaned sessions
Cost of unused active sessions during low-traffic periods
Requires session affinity logic and TTL-based cleanup in DynamoDB

Recommended for financial systems with predictable traffic patterns

Option C: Step Functions Orchestration with AgentCore as Worker

Pros

Explicit session lifecycle control with auditable state
Native Step Functions retry/idempotency; CloudWatch Alarms integration
Clear separation between business orchestration and agent execution

Cons

Additional latency per Step Functions hop (50-200ms per state transition)
State transition cost in high-frequency workflows
State machine design complexity for long-running agent flows

Ideal for approval workflows, compliance, and long-running business processes

Option D: Active-Active Multi-Region with Latency-Based Routing

Pros

Uses the 5,000-session pool of us-east-1 and us-west-2 in parallel
Regional resilience; no runtime single point of failure

Cons

Session state is not natively replicated across regions by AgentCore
Context consistency complexity and cross-region session affinity
Cross-region data cost and state synchronization latency

Valid only for RTO < 1min requirements; requires careful consistency design

The Decision: Session Pool with Admission Control and External Audit

For financial systems operating AgentCore in production, the decision I advocate is Option B with elements of Option C: a pre-warmed session pool managed by Lambda with admission control, combined with Step Functions orchestration for flows requiring approval or regulatory audit.

The reasoning is direct: the 25 new sessions/s limit is the only bottleneck that cannot be resolved with more runtime capacity — it is an API rate limit, not a compute capacity limit. Pre-warming transforms this limit from a burst bottleneck into a capacity planning parameter. With a Lambda scheduled every 5 minutes checking the pool and creating sessions up to a configurable target (say, 200 ready sessions), you absorb bursts of up to 200 simultaneous users without touching the creation limit.

For admission control, I use a DynamoDB table with partition key poolId and sort key sessionId, with a status attribute (AVAILABLE | IN_USE | DRAINING) and 30-minute TTL. A dispatcher Lambda performs a conditional UpdateItem with ConditionExpression: attribute_exists(sessionId) AND #status = :available — this ensures two concurrent dispatchers never assign the same session. Idempotency is guaranteed by the correlationId from the original request, stored as a session attribute.

For audit, every agent interaction — input, output, tools invoked, tokens consumed — is published to a Kinesis Data Firehose stream with S3 destination with Object Lock (COMPLIANCE mode, 7 years for Brazilian financial regulation). The KMS key policy restricts kms:Decrypt to audit roles with condition aws:PrincipalTag/Role: AuditReader, preventing the main application from accessing its own audit logs.

Session Pool Architecture with Admission Control for AgentCore

AgentCore session lifecycle flow in a financial environment: pre-warming, dispatch with admission control, agent execution, immutable audit, and session draining.

🌐 Entrada / Ingress

API Gateway · REST + WAF
Lambda Dispatcher · Conditional UpdateItem

🗄️ Pool de Sessões / Session Pool

DynamoDB · poolId | sessionId · status + TTL 30min
Lambda Warmer · EventBridge 5min · Cria sessões até target

🤖 AgentCore Runtime

AgentCore Runtime · 5k sessões / us-east-1 · 200 interactions/s
Bedrock Model · Claude / Titan · Token budget enforced

🔐 Auditoria & Segurança / Audit & Security

Kinesis Firehose · Interação → S3
S3 Object Lock · COMPLIANCE 7 anos · KMS AuditReader only
CloudWatch · SLO: sessions/quota · Alarm: >80% pool used

🔄 Fluxos Regulatórios / Regulatory Flows

Step Functions · Aprovação / Compliance · Workflow de longa duração

Observability: What to Monitor When Sessions Scale

With 5,000 potentially active sessions, observability cannot be reactive — you need SLOs defined before going to production. The three signals I instrument in any financial-scale AgentCore deployment are:

1. Pool Utilization Ratio: sessions_in_use / pool_target. Alarm at 80% — not 100%. At 80%, the warmer Lambda should be triggered immediately to create additional sessions. If you wait for 100%, you're already rejecting requests. In CloudWatch, this is a custom metric published by the dispatcher with PoolId and Region dimensions, with an alarm action invoking the warmer via SNS.

2. Session Creation Lag: time between a new session request and the session being available for use. Under normal conditions with pre-warming, this should be zero (session already available in pool). If it starts rising, it indicates the warmer isn't keeping pace — or the 25 sessions/s limit is being hit. A CloudWatch Metric Math of p99(session_allocation_latency) above 500ms should fire a capacity alarm.

3. Token Budget Exhaustion Rate: percentage of interactions that hit the configured per-session token limit. In financial systems, an agent exhausting its token budget is likely in a loop or received an adversarial prompt. This signal should trigger both an operational alarm and a security event — correlated to sessionId and customerId for investigation.

For distributed correlation, I inject the original request's correlationId as an AgentCore session attribute and as a field in all Firehose events. This allows tracing a user interaction from API Gateway to the S3 audit log, through all model invocations — essential for compliance investigations.

Consequences: What Can Go Wrong with Sessions at Scale

Orphaned sessions silently accumulate cost. If the dispatcher fails after allocating a session but before marking it IN_USE, the session gets stuck as AVAILABLE but with contaminated context. Implement a reconciliation Lambda that scans sessions with last_activity > 15min and IN_USE status, drains them, and creates replacements. Without this, you'll accumulate sessions with stale context causing incorrect responses and unnecessary charges. The 25 sessions/s limit is per AWS account, not per region. If you operate multiple environments (dev, staging, prod) in the same account in us-east-1, they all share this limit. Separate environments into distinct AWS accounts with AWS Organizations — this is not optional in regulated financial environments. Session creation bursts can mask an attack. A sudden spike in session creation rate can be legitimate traffic or a resource exhaustion attempt. Configure a WAF rate rule on API Gateway limiting session creation requests to 10 per IP per second, and an AWS Shield Advanced rule for anomalous patterns. The cost of an AgentCore session created by an attacker is your cost — not theirs.

Regulatory and Governance Implications for Financial Markets

The AgentCore quota increase arrives at a moment when global financial regulators — including Brazil's Central Bank with its AI guidelines for financial institutions, and DORA in Europe — are formalizing requirements for AI systems in production. Scaling agents to 5,000 concurrent sessions without a corresponding governance strategy is a regulatory risk, not just a technical one.

The three governance requirements every AgentCore deployment in a financial environment must address are: decision explainability, granular access control, and adversarial testability.

For explainability, recording each agent interaction in Firehose/S3 is not sufficient — you need to record why the agent made each decision: which tools it considered, which it discarded, what the intermediate reasoning was. This requires configuring AgentCore with enableTrace: true and capturing trace events, not just the final output. In a BACEN audit, 'the model decided' is not an acceptable answer.

For access control, each AgentCore session must be associated with a verified end-user identity — not just the application's IAM role. This means propagating the sub from the user's JWT token as a session attribute and using that attribute in tool authorization policies within the agent. An agent that can invoke any tool for any user is a violation of least-privilege at the runtime level.

For adversarial testability, with 200 interactions/s available, you have capacity to run automated red team tests in staging without impacting production. Implement a CI/CD pipeline that executes a known set of adversarial prompts against each new agent instruction version before deployment — and block the deploy if the inadequate response rate exceeds a defined threshold.

AgentCore at Scale: Well-Architected Lenses

Security

Propagate end-user identity as AgentCore session attribute. Use KMS CMK with restrictive key policy for audit logs. Configure WAF rate limiting on session creation. Separate environments into distinct AWS accounts.

Reliability

Implement session pool pre-warming to absorb bursts without hitting the 25 sessions/s limit. Periodic orphaned session reconciliation. Per-session circuit breaker to prevent agent loops. Multi-AZ for pool control DynamoDB.

Performance efficiency

Pool utilization ratio as primary SLO (target < 80%). Alarm on p99 session allocation latency > 500ms. Per-session token budget to prevent long-running interactions blocking capacity.

Common Anti-Patterns When Scaling AgentCore

Creating a new AgentCore session per HTTP request without a pool — exhausts the 25 sessions/s limit under any moderate burst and adds unnecessary session creation latency to P99.
Using the same AWS account for dev, staging, and prod with AgentCore — all environments share runtime quotas, and a staging load test can degrade production.
Not configuring a per-session token budget — an agent in a loop or with an adversarial prompt can consume model budget in minutes, with no alarm until the billing cycle ends.
Logging only the agent's final output without trace events — in a regulatory audit, you cannot reconstruct the agent's reasoning and 'the model decided' is not an acceptable answer.
Not implementing session draining logic during new instruction version deploys — active sessions with the previous version continue responding with outdated behavior during the rollout.

Architect's Note

Senior Solutions Architect

In practice, what concerns me about this announcement isn't the quota increase itself — it's that it will encourage teams to put agents into production without having solved the session governance problem first. I've seen this happen with Lambda concurrency and with Kinesis shards: capacity grows before operational maturity, and the result is cost and data incidents that cost far more than the time it would have taken to design the admission control correctly. My concrete recommendation: before using more than 500 concurrent sessions in production, implement the pool with DynamoDB, the utilization alarm at 80%, and trace logging in Firehose — in that order, without skipping steps. The hard-won lesson is that in financial systems, the ability to scale is a risk until you have the observability to understand what is happening at that scale.

Verdict: Use the Capacity, But Govern Before You Scale

Adote com Governança / Adopt with Govern

The AgentCore default quota increase is a genuinely significant change for those building agentic systems in production — it eliminates a category of operational friction and makes AgentCore viable as a primary runtime for financial-scale workloads. But the correct architectural decision is not 'scale to 5,000 sessions' — it's 'implement admission control, immutable audit, and pool observability before using more than 10% of that capacity'. The 25 new sessions/s limit is the real bottleneck; session pre-warming with DynamoDB resolves it. Separating environments into distinct AWS accounts is non-negotiable in regulated contexts. And trace logging with enableTrace: true is what separates an auditable system from a system that merely works. For financial teams: this is the moment to build the session governance foundation — not to simply increase the pool target.

References

Amazon Bedrock AgentCore increases default runtime quota limits (AWS What's New, Jul 1 2026)AgentCore Runtime Release Notes — Increased Default Service Quotas (June 2026)Quotas for Amazon Bedrock AgentCore AgentCore Developer Guide AgentCore Supported Regions Amazon Bedrock AgentCore Product Page Amazon Bedrock AgentCore now available in four additional AWS Regions (AWS What's New, Jun 2026)AWS Well-Architected Framework — Machine Learning Lens

#bedrock-agentcore#agentic-ai#quota-management#financial-grade#aws-well-architected#session-scaling#observability#enterprise-ai

Liked this? Get the next one.

Architecture, AWS, AI and market deep dives — straight to your inbox. Free.

No spam · unsubscribe anytime

Analyzed source: Amazon Bedrock AgentCore increases default runtime quota limits

Ask Fernando about this

Get a focused answer about this article from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Keep reading

AI & AgentsAWS WAF on AgentCore Gateway: Production Security for Agentic AIThe general availability of AWS WAF for Amazon Bedrock AgentCore Gateway marks an important inflection point: agentic AI workloads can now receive consistent edge protections without per-agent instrumentation. I analyze what this integration actually delivers, where it still leaves gaps, and how to build a mature security posture for agentic systems in production.Read AI & AgentsAmazon Bedrock AgentCore Harness: From Idea to Production-Grade AgentAgentCore Harness reached GA in June 2026 as a managed abstraction that collapses the LLM agent control plane into two API calls. In this article, I analyze how the harness works internally, where it fails, and what architects of financial-grade systems need to understand before putting it into production.Read AI & AgentsAmazon Bedrock AgentCore: Continuous Agent Optimization in ProductionAmazon Bedrock AgentCore introduces a continuous improvement loop that turns production traces into actionable diagnostics, data-grounded recommendations, and statistical validation via A/B testing. For architects of financial systems and high-stakes platforms, this represents AWS's first serious attempt to close the gap between agent observability and reliable production operation.Read

Architecture newsletter

Architecture intelligence, in your inbox

Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.

Curated AWS · AI · architecture · market signals
New architecture studies & deep-dives when they ship
Sharp summaries — depth without the noise
No spam · double opt-in · unsubscribe anytime

AI & AgentsDecision Record

ADR: Scaling Agents to Production with AgentCore Runtime Quotas

Jul 2, 2026 9 minexpert AI-assisted

Listen to article

Fernando's voice

Fernando · 17:14

Download MP3

0:0017:14

Speed

The MP3 is saved to S3 after the first play.

AI & AgentsDecision Record

5,000

Active concurrent sessions (us-east-1, us-west-2)

Default without quota increase request

2,500

Active concurrent sessions (other supported regions)

Available in all regions where AgentCore is available

200/s

Agent interactions per second (all regions)

New uniform regional default

fernando.moretes.com

Context and Forces: Why This Matters Now

AgentCore Runtime New Default Limits (July 2026)

5,000

Active concurrent sessions (us-east-1, us-west-2)

Default without quota increase request

2,500

Active concurrent sessions (other supported regions)

Available in all regions where AgentCore is available

200/s

Agent interactions per second (all regions)

New uniform regional default

25/s

New sessions created per second (all regions)

Session creation rate — critical bottleneck under burst traffic

The Architectural Forces This Announcement Exposes

Architectural Options for Agent Sessions at Scale

Option A: Session per Request (Stateless Pattern)

Pros

Simple to implement; no pool management
Complete isolation between requests

Cons

Exhausts the 25 new sessions/s limit quickly under burst
Session creation overhead adds significant P99 latency
No context reuse across interactions from the same user

Acceptable only for low-volume, low-frequency workloads

Option B: Session Pool with Pre-warming via Scheduled Lambda

Pros

Absorbs bursts without hitting the session creation limit
Controlled first-interaction latency (session already active)
Allows user context association to pre-warmed sessions

Cons

Pool management complexity; risk of orphaned sessions
Cost of unused active sessions during low-traffic periods
Requires session affinity logic and TTL-based cleanup in DynamoDB

Recommended for financial systems with predictable traffic patterns

Option C: Step Functions Orchestration with AgentCore as Worker

Pros

Explicit session lifecycle control with auditable state
Native Step Functions retry/idempotency; CloudWatch Alarms integration
Clear separation between business orchestration and agent execution

Cons

Additional latency per Step Functions hop (50-200ms per state transition)
State transition cost in high-frequency workflows
State machine design complexity for long-running agent flows

Ideal for approval workflows, compliance, and long-running business processes

Option D: Active-Active Multi-Region with Latency-Based Routing

Pros

Uses the 5,000-session pool of us-east-1 and us-west-2 in parallel
Regional resilience; no runtime single point of failure

Cons

Session state is not natively replicated across regions by AgentCore
Context consistency complexity and cross-region session affinity
Cross-region data cost and state synchronization latency

Valid only for RTO < 1min requirements; requires careful consistency design

The Decision: Session Pool with Admission Control and External Audit

Session Pool Architecture with Admission Control for AgentCore

AgentCore session lifecycle flow in a financial environment: pre-warming, dispatch with admission control, agent execution, immutable audit, and session draining.

🌐 Entrada / Ingress

API Gateway · REST + WAF
Lambda Dispatcher · Conditional UpdateItem

🗄️ Pool de Sessões / Session Pool

DynamoDB · poolId | sessionId · status + TTL 30min
Lambda Warmer · EventBridge 5min · Cria sessões até target

🤖 AgentCore Runtime

AgentCore Runtime · 5k sessões / us-east-1 · 200 interactions/s
Bedrock Model · Claude / Titan · Token budget enforced

🔐 Auditoria & Segurança / Audit & Security

Kinesis Firehose · Interação → S3
S3 Object Lock · COMPLIANCE 7 anos · KMS AuditReader only
CloudWatch · SLO: sessions/quota · Alarm: >80% pool used

🔄 Fluxos Regulatórios / Regulatory Flows

Step Functions · Aprovação / Compliance · Workflow de longa duração

Observability: What to Monitor When Sessions Scale

Consequences: What Can Go Wrong with Sessions at Scale

Regulatory and Governance Implications for Financial Markets

The three governance requirements every AgentCore deployment in a financial environment must address are: decision explainability, granular access control, and adversarial testability.

AgentCore at Scale: Well-Architected Lenses

Security

Reliability

Performance efficiency

Pool utilization ratio as primary SLO (target < 80%). Alarm on p99 session allocation latency > 500ms. Per-session token budget to prevent long-running interactions blocking capacity.

Common Anti-Patterns When Scaling AgentCore

Creating a new AgentCore session per HTTP request without a pool — exhausts the 25 sessions/s limit under any moderate burst and adds unnecessary session creation latency to P99.
Using the same AWS account for dev, staging, and prod with AgentCore — all environments share runtime quotas, and a staging load test can degrade production.
Not configuring a per-session token budget — an agent in a loop or with an adversarial prompt can consume model budget in minutes, with no alarm until the billing cycle ends.
Logging only the agent's final output without trace events — in a regulatory audit, you cannot reconstruct the agent's reasoning and 'the model decided' is not an acceptable answer.
Not implementing session draining logic during new instruction version deploys — active sessions with the previous version continue responding with outdated behavior during the rollout.

Architect's Note

Senior Solutions Architect

Verdict: Use the Capacity, But Govern Before You Scale

Adote com Governança / Adopt with Govern

References

#bedrock-agentcore#agentic-ai#quota-management#financial-grade#aws-well-architected#session-scaling#observability#enterprise-ai

Liked this? Get the next one.

Architecture, AWS, AI and market deep dives — straight to your inbox. Free.

No spam · unsubscribe anytime

Analyzed source: Amazon Bedrock AgentCore increases default runtime quota limits

Ask Fernando about this

Get a focused answer about this article from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Keep reading

Architecture newsletter

Architecture intelligence, in your inbox

Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.

Curated AWS · AI · architecture · market signals
New architecture studies & deep-dives when they ship
Sharp summaries — depth without the noise
No spam · double opt-in · unsubscribe anytime