ADR: Scaling Agents to Production with AgentCore Runtime Quotas
Listen to article
Fernando's voiceFernando · 17:14
Powered by Amazon Polly + OmniVoice
In July 2026, AWS raised AgentCore Runtime default limits to 5,000 active concurrent sessions in us-east-1/us-west-2 and 200 interactions per second across all regions. This ADR documents the context that forced this design decision, the architectural options I evaluated for financial-grade agentic systems at scale, and the operational consequences you must plan for before putting agents into production.
Doubled default quotas aren't just a nice product headline — they're a signal that AWS is betting agentic workloads are going to production at scale. This ADR is about what changes architecturally when the ceiling rises, and what can still take you down.
Context and Forces: Why This Matters Now
Since Amazon Bedrock AgentCore launched, I've been closely tracking how financial engineering teams try to fit AI agents into workflows that demand auditability, controlled latency, and fault isolation. The recurring problem wasn't model capability — it was runtime infrastructure. Conservative session limits forced workaround architectures: SQS queues to serialize agent calls, Step Functions to manage session lifecycles outside the platform, and manual retry logic that duplicated what AgentCore itself should provide.
The July 1, 2026 announcement changes that calculus. With 5,000 active concurrent sessions in us-east-1 and us-west-2, and 2,500 in other supported regions, AgentCore becomes viable as a primary runtime for use cases that previously required hybrid solutions. The 200 agent interactions per second and 25 new sessions per second as defaults — without requiring a quota increase request — eliminate an entire category of operational friction when onboarding new workloads.
But raising the default ceiling doesn't solve the design problems that emerge when you actually use that capacity. In financial environments, every agent session carries sensitive context: customer data, transaction state, decision history. Horizontal session scaling amplifies the attack surface, observability complexity, and cost risk. The technical signal here isn't 'use more agents' — it's 'now you need a serious session control architecture'.
AgentCore Runtime New Default Limits (July 2026)
The Architectural Forces This Announcement Exposes
When you read '5,000 concurrent sessions', the first engineering reaction is: 'great, I can scale'. The second reaction — the one that matters — should be: 'what happens when I'm at 4,800 sessions and a burst of 300 simultaneous users tries to create new sessions at 25/s?'
The 25 new sessions per second limit is the real bottleneck in this architecture. In a financial system with market peaks — exchange open, settlement windows, credit events — the session creation rate can exceed this limit even with abundant headroom in the active session pool. This demands a session pre-warming pattern: creating sessions in advance and holding them in a managed pool, reusing them for incoming requests rather than creating a new session per user interaction.
The second pressure vector is context isolation. With 5,000 active sessions, each potentially carrying customer context in runtime memory, the question of where session state lives becomes critical. AgentCore manages session state internally, but for regulatory audit purposes in financial environments, you need an external, immutable audit trail. This means integrating AgentCore with DynamoDB (for projected session state with TTL and partition key on customerId#sessionId) and S3 with Object Lock for agent decision logs.
The third vector is cost. At 200 interactions/s sustained, with each interaction invoking a Bedrock model (Claude, Titan, or otherwise), token costs can scale non-linearly. Without per-session circuit breakers and without per-interaction token limits configured explicitly, a poorly instructed agent can consume a week's budget in hours.
Architectural Options for Agent Sessions at Scale
Option A: Session per Request (Stateless Pattern)
- Simple to implement; no pool management
- Complete isolation between requests
- Exhausts the 25 new sessions/s limit quickly under burst
- Session creation overhead adds significant P99 latency
- No context reuse across interactions from the same user
Acceptable only for low-volume, low-frequency workloads
Option B: Session Pool with Pre-warming via Scheduled Lambda
- Absorbs bursts without hitting the session creation limit
- Controlled first-interaction latency (session already active)
- Allows user context association to pre-warmed sessions
- Pool management complexity; risk of orphaned sessions
- Cost of unused active sessions during low-traffic periods
- Requires session affinity logic and TTL-based cleanup in DynamoDB
Recommended for financial systems with predictable traffic patterns
Option C: Step Functions Orchestration with AgentCore as Worker
- Explicit session lifecycle control with auditable state
- Native Step Functions retry/idempotency; CloudWatch Alarms integration
- Clear separation between business orchestration and agent execution
- Additional latency per Step Functions hop (50-200ms per state transition)
- State transition cost in high-frequency workflows
- State machine design complexity for long-running agent flows
Ideal for approval workflows, compliance, and long-running business processes
Option D: Active-Active Multi-Region with Latency-Based Routing
- Uses the 5,000-session pool of us-east-1 and us-west-2 in parallel
- Regional resilience; no runtime single point of failure
- Session state is not natively replicated across regions by AgentCore
- Context consistency complexity and cross-region session affinity
- Cross-region data cost and state synchronization latency
Valid only for RTO < 1min requirements; requires careful consistency design
The Decision: Session Pool with Admission Control and External Audit
For financial systems operating AgentCore in production, the decision I advocate is Option B with elements of Option C: a pre-warmed session pool managed by Lambda with admission control, combined with Step Functions orchestration for flows requiring approval or regulatory audit.
The reasoning is direct: the 25 new sessions/s limit is the only bottleneck that cannot be resolved with more runtime capacity — it is an API rate limit, not a compute capacity limit. Pre-warming transforms this limit from a burst bottleneck into a capacity planning parameter. With a Lambda scheduled every 5 minutes checking the pool and creating sessions up to a configurable target (say, 200 ready sessions), you absorb bursts of up to 200 simultaneous users without touching the creation limit.
For admission control, I use a DynamoDB table with partition key poolId and sort key sessionId, with a status attribute (AVAILABLE | IN_USE | DRAINING) and 30-minute TTL. A dispatcher Lambda performs a conditional UpdateItem with ConditionExpression: attribute_exists(sessionId) AND #status = :available — this ensures two concurrent dispatchers never assign the same session. Idempotency is guaranteed by the correlationId from the original request, stored as a session attribute.
For audit, every agent interaction — input, output, tools invoked, tokens consumed — is published to a Kinesis Data Firehose stream with S3 destination with Object Lock (COMPLIANCE mode, 7 years for Brazilian financial regulation). The KMS key policy restricts kms:Decrypt to audit roles with condition aws:PrincipalTag/Role: AuditReader, preventing the main application from accessing its own audit logs.
Session Pool Architecture with Admission Control for AgentCore
AgentCore session lifecycle flow in a financial environment: pre-warming, dispatch with admission control, agent execution, immutable audit, and session draining.
- API Gateway · REST + WAF
- Lambda Dispatcher · Conditional UpdateItem
- DynamoDB · poolId | sessionId · status + TTL 30min
- Lambda Warmer · EventBridge 5min · Cria sessões até target
- AgentCore Runtime · 5k sessões / us-east-1 · 200 interactions/s
- Bedrock Model · Claude / Titan · Token budget enforced
- Kinesis Firehose · Interação → S3
- S3 Object Lock · COMPLIANCE 7 anos · KMS AuditReader only
- CloudWatch · SLO: sessions/quota · Alarm: >80% pool used
- Step Functions · Aprovação / Compliance · Workflow de longa duração
Observability: What to Monitor When Sessions Scale
With 5,000 potentially active sessions, observability cannot be reactive — you need SLOs defined before going to production. The three signals I instrument in any financial-scale AgentCore deployment are:
1. Pool Utilization Ratio: sessions_in_use / pool_target. Alarm at 80% — not 100%. At 80%, the warmer Lambda should be triggered immediately to create additional sessions. If you wait for 100%, you're already rejecting requests. In CloudWatch, this is a custom metric published by the dispatcher with PoolId and Region dimensions, with an alarm action invoking the warmer via SNS.
2. Session Creation Lag: time between a new session request and the session being available for use. Under normal conditions with pre-warming, this should be zero (session already available in pool). If it starts rising, it indicates the warmer isn't keeping pace — or the 25 sessions/s limit is being hit. A CloudWatch Metric Math of p99(session_allocation_latency) above 500ms should fire a capacity alarm.
3. Token Budget Exhaustion Rate: percentage of interactions that hit the configured per-session token limit. In financial systems, an agent exhausting its token budget is likely in a loop or received an adversarial prompt. This signal should trigger both an operational alarm and a security event — correlated to sessionId and customerId for investigation.
For distributed correlation, I inject the original request's correlationId as an AgentCore session attribute and as a field in all Firehose events. This allows tracing a user interaction from API Gateway to the S3 audit log, through all model invocations — essential for compliance investigations.
Consequences: What Can Go Wrong with Sessions at Scale
Orphaned sessions silently accumulate cost. If the dispatcher fails after allocating a session but before marking it IN_USE, the session gets stuck as AVAILABLE but with contaminated context. Implement a reconciliation Lambda that scans sessions with last_activity > 15min and IN_USE status, drains them, and creates replacements. Without this, you'll accumulate sessions with stale context causing incorrect responses and unnecessary charges.
The 25 sessions/s limit is per AWS account, not per region. If you operate multiple environments (dev, staging, prod) in the same account in us-east-1, they all share this limit. Separate environments into distinct AWS accounts with AWS Organizations — this is not optional in regulated financial environments.
Session creation bursts can mask an attack. A sudden spike in session creation rate can be legitimate traffic or a resource exhaustion attempt. Configure a WAF rate rule on API Gateway limiting session creation requests to 10 per IP per second, and an AWS Shield Advanced rule for anomalous patterns. The cost of an AgentCore session created by an attacker is your cost — not theirs.
Regulatory and Governance Implications for Financial Markets
The AgentCore quota increase arrives at a moment when global financial regulators — including Brazil's Central Bank with its AI guidelines for financial institutions, and DORA in Europe — are formalizing requirements for AI systems in production. Scaling agents to 5,000 concurrent sessions without a corresponding governance strategy is a regulatory risk, not just a technical one.
The three governance requirements every AgentCore deployment in a financial environment must address are: decision explainability, granular access control, and adversarial testability.
For explainability, recording each agent interaction in Firehose/S3 is not sufficient — you need to record why the agent made each decision: which tools it considered, which it discarded, what the intermediate reasoning was. This requires configuring AgentCore with enableTrace: true and capturing trace events, not just the final output. In a BACEN audit, 'the model decided' is not an acceptable answer.
For access control, each AgentCore session must be associated with a verified end-user identity — not just the application's IAM role. This means propagating the sub from the user's JWT token as a session attribute and using that attribute in tool authorization policies within the agent. An agent that can invoke any tool for any user is a violation of least-privilege at the runtime level.
For adversarial testability, with 200 interactions/s available, you have capacity to run automated red team tests in staging without impacting production. Implement a CI/CD pipeline that executes a known set of adversarial prompts against each new agent instruction version before deployment — and block the deploy if the inadequate response rate exceeds a defined threshold.
AgentCore at Scale: Well-Architected Lenses
Security
Propagate end-user identity as AgentCore session attribute. Use KMS CMK with restrictive key policy for audit logs. Configure WAF rate limiting on session creation. Separate environments into distinct AWS accounts.
Reliability
Implement session pool pre-warming to absorb bursts without hitting the 25 sessions/s limit. Periodic orphaned session reconciliation. Per-session circuit breaker to prevent agent loops. Multi-AZ for pool control DynamoDB.
Performance efficiency
Pool utilization ratio as primary SLO (target < 80%). Alarm on p99 session allocation latency > 500ms. Per-session token budget to prevent long-running interactions blocking capacity.
Common Anti-Patterns When Scaling AgentCore
- Creating a new AgentCore session per HTTP request without a pool — exhausts the 25 sessions/s limit under any moderate burst and adds unnecessary session creation latency to P99.
- Using the same AWS account for dev, staging, and prod with AgentCore — all environments share runtime quotas, and a staging load test can degrade production.
- Not configuring a per-session token budget — an agent in a loop or with an adversarial prompt can consume model budget in minutes, with no alarm until the billing cycle ends.
- Logging only the agent's final output without trace events — in a regulatory audit, you cannot reconstruct the agent's reasoning and 'the model decided' is not an acceptable answer.
- Not implementing session draining logic during new instruction version deploys — active sessions with the previous version continue responding with outdated behavior during the rollout.
In practice, what concerns me about this announcement isn't the quota increase itself — it's that it will encourage teams to put agents into production without having solved the session governance problem first. I've seen this happen with Lambda concurrency and with Kinesis shards: capacity grows before operational maturity, and the result is cost and data incidents that cost far more than the time it would have taken to design the admission control correctly. My concrete recommendation: before using more than 500 concurrent sessions in production, implement the pool with DynamoDB, the utilization alarm at 80%, and trace logging in Firehose — in that order, without skipping steps. The hard-won lesson is that in financial systems, the ability to scale is a risk until you have the observability to understand what is happening at that scale.
Verdict: Use the Capacity, But Govern Before You Scale
The AgentCore default quota increase is a genuinely significant change for those building agentic systems in production — it eliminates a category of operational friction and makes AgentCore viable as a primary runtime for financial-scale workloads. But the correct architectural decision is not 'scale to 5,000 sessions' — it's 'implement admission control, immutable audit, and pool observability before using more than 10% of that capacity'. The 25 new sessions/s limit is the real bottleneck; session pre-warming with DynamoDB resolves it. Separating environments into distinct AWS accounts is non-negotiable in regulated contexts. And trace logging with enableTrace: true is what separates an auditable system from a system that merely works. For financial teams: this is the moment to build the session governance foundation — not to simply increase the pool target.
References
Architecture, AWS, AI and market deep dives — straight to your inbox. Free.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this article from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.
Keep reading
Architecture intelligence, in your inbox
Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.
- Curated AWS · AI · architecture · market signals
- New architecture studies & deep-dives when they ship
- Sharp summaries — depth without the noise
- No spam · double opt-in · unsubscribe anytime