Playbook: A Production AI Agent on AWS in 7 Steps
Listen to study
generated on playGenerated only on first play
Powered by Amazon Polly + OmniVoice
Ninety percent of AI agents die in the notebook. This playbook covers the 7 steps that separate a working prototype from a reliable production agent on AWS — from trigger to guardrail, covering loop topology, least-privilege tools, external verifier, and execution budget. Opinionated, concrete, and grounded in the real primitives of Amazon Bedrock AgentCore.
Building an agent that works in a notebook is easy. Putting that agent into production — with controlled cost, traceability, security, and guaranteed termination — is where most teams stop. This playbook is the map of the 7 steps you need to walk through, in the right order, with the right decisions.
TL;DR — What you'll be able to decide after reading this
Playbook Context
- Type
- Prescriptive playbook — company-agnostic, AWS-based
- Primary platform
- Amazon Bedrock AgentCore (Runtime, Gateway, Memory, Identity, Observability)
- Models covered
- Any model via Bedrock (Claude, Titan, Llama, etc.)
- Reference pattern
- Anthropic 'Building Effective Agents' + AgentCore primitives
- Where 90% stops
- Notebook to production transition: no budget, no verifier, no observability
- Minimum prerequisite
- AWS account, Bedrock access enabled, basic IAM configured
The mental model that unlocks everything: the agent is a loop with a budget
The difference between a notebook agent and a production agent is not the model — it's the structure around the loop. An agent, at its core, is a cycle: the model observes the current state, decides on an action, executes via tool, observes the result, and decides whether to terminate or iterate. This cycle can run once or a thousand times. In production, you need to know exactly how many times it will run, how much it will cost, and how you'll know it terminated correctly.
Anthropic describes this precisely in 'Building Effective Agents': most successful agentic systems are not complex autonomous agents — they are sets of LLM-augmented workflows where autonomy is applied surgically. The most common mistake is giving the agent more autonomy than the problem requires. A linear pipeline with one tool call solves 70% of cases. The reflection loop solves another 20%. Multi-agent with orchestration is for the remaining 10% — and it charges a high price in complexity and cost.
Amazon Bedrock AgentCore materializes this mental model in concrete primitives: the Runtime executes the loop; the Gateway manages tool access with protocol control (MCP); Memory persists state between sessions; Identity manages agent credentials and permissions; Observability traces each iteration. When you structure your agent around these five primitives, you're forcing the right questions before writing code.
Agent Loop Topologies — When to Use Each One
| Topology | Structure | Relative cost | Latency | Ops complexity | |
|---|---|---|---|---|---|
| Linear (pipeline) | Prompt → Tool → Response (1 iteration) | $ (minimum) | Low | Minimal | Well-defined task, predictable output |
| Reflection | Generate → Critique → Refine (N iterations) | $$ (2–4x linear) | Medium | Low | Output quality matters more than speed |
| Plan-Execute | Plan → Sub-tasks → Sequential execution | $$$ (3–6x linear) | High | Medium | Long tasks with inter-step dependencies |
| Multi-agent | Orchestrator + N specialized agents | $$$$ (highest) | High / variable | High | Real parallelism; security-isolated domains |
Why the external verifier is not optional
The most frequent architectural mistake I see in production agents is the agent evaluating its own output. This is equivalent to asking a developer to review their own PR with no external reviewer — it works sometimes, fails systematically when it matters.
An external verifier is any mechanism that evaluates the agent's output without using the same model with the same context. The practical options are: (1) a second, smaller, cheaper model with a focused evaluation prompt; (2) a deterministic Lambda function that validates schema, range, or business invariants; (3) a set of unit tests executed as a tool; (4) a human in the loop for high-risk cases. Bedrock AgentCore natively supports this pattern via configurable Guardrails that intercept outputs before returning to the caller.
The practical rule: if the agent's failure has financial, legal, or security consequences, the verifier must be deterministic (code, not LLM). If the failure is about content quality, a second LLM with an evaluation prompt is sufficient. Never use the same model, the same prompt, and the same context to both generate and verify — you're just amplifying the original bias.
The 7 Steps: From Notebook to Production
- 1
Step 1 — Define the TRIGGER with surgical precision
Before any code, answer: what starts this agent? The three classes are: (a) Event — S3 PutObject, SQS message, EventBridge rule; (b) Synchronous API — HTTP call via API Gateway with response expected in < 30s; (c) Schedule — EventBridge Scheduler for periodic tasks. Document the exact trigger payload, the expected response SLA, and whether the caller expects a synchronous response or accepts async with callback. This determines whether you use Lambda (< 15 min), ECS Fargate (long duration), or AgentCore Runtime directly. Testable: can you trigger the agent manually with a synthetic payload and verify it starts? If not, the trigger is not defined.
- 2
Step 2 — Choose the loop TOPOLOGY (use the comparison table)
Use the comparison table above. The golden rule: always start with the simplest that solves the problem. If linear solves it, don't implement reflection. If a single agent solves it, don't implement multi-agent. Document the chosen topology and the reason — this will be questioned in code review and post-mortems. For AgentCore, the topology determines how you configure the AgentCore Runtime: maximum number of turns, whether memory is enabled (for reflection and plan-execute you need session memory), and whether you'll use sub-agents. Testable: can you draw the loop in a sequence diagram with at most 5 boxes? If not, it's too complex.
- 3
Step 3 — Define TOOLS with least privilege
Every tool the agent can call is an attack surface. Use the AgentCore Gateway with MCP protocol to register tools with explicit scope. For each tool, define: (1) Dedicated IAM role for the agent — never AdministratorAccess, never PowerUser; (2) Explicit allowlist of permitted actions (e.g.,
s3:GetObjecton a specific bucket, nots3:*); (3) Validated input/output schema — AgentCore Gateway validates the schema before executing; (4) Per-tool timeout — define it in the Gateway, don't rely on the downstream Lambda timeout. Practical rule: if you can't list in 3 lines what the tool does and what it cannot do, the scope is too broad. Testable: runaws iam simulate-principal-policywith the agent's role and confirm that actions outside the allowlist are denied. - 4
Step 4 — Implement an external VERIFIER
The verifier evaluates the agent's output before it's delivered to the caller or persisted. Implement as: (a) Deterministic Lambda for schema validation and business invariants (cost: cents, latency: < 100ms); (b) Second Bedrock model with focused evaluation prompt for content quality (use a smaller model — Claude Haiku or similar); (c) AgentCore Guardrail for content filtering, PII, and prohibited topics — configure via console or IaC (CDK/Terraform). The verifier must return:
PASS,FAIL_HARD(stop the loop, return error to caller), orFAIL_SOFT(iterate again with feedback). Document which failure class triggers each response. Testable: inject a deliberately incorrect output and confirm the verifier rejects it with the correct error code. - 5
Step 5 — Configure STOP RULES and execution budget
An agent without stop rules is a process without
ulimit— it will consume resources until the infrastructure timeout, which is the worst place to stop. Define explicitly: (1) Maximum iterations — in AgentCore Runtime, configuremaxIterations; start conservative (5–10) and adjust with data; (2) Total timeout — configure at the Runtime level AND in the Lambda/Fargate hosting it; use the lower of the two; (3) Token ceiling — estimate cost per iteration, multiply by maximum iterations, define a ceiling in input+output tokens; AgentCore exposes per-invocation token metrics; (4) Cost ceiling — use AWS Budgets with alert at 80% of expected monthly ceiling for the agent; (5) Code-verifiable stop condition — the termination condition must be testable without calling the LLM - 6
Step 6 — Implement loop OBSERVABILITY
You cannot operate what you cannot observe. For agents, observability has an extra dimension: you need visibility per iteration, not just per invocation. AgentCore Observability emits structured traces with: iteration number, tokens consumed, tool called, tool latency, model decision (continue/stop), and verifier result. Configure: (1) CloudWatch Logs with dedicated log group per agent, minimum 90-day retention for audit; (2) X-Ray trace enabled in AgentCore Runtime — each loop iteration appears as a child span; (3) Custom metrics via CloudWatch EMF:
agent.iterations_per_invocation,agent.tokens_per_invocation,agent.tool_call_latency_ms,agent.verifier_fail_rate; (4) CloudWatch Dashboard with: iteration distribution, estimated cost per invocation (tokens × - 7
Step 7 — Configure GUARDRAILS and audit trail
Guardrails are not optional in production — they are the contract between the agent and the business. In AgentCore, configure: (1) Content filtering — define prohibited categories (violence, PII, unauthorized financial data) with per-category threshold; (2) Topic denial — explicit list of topics the agent must not address, regardless of user prompt; (3) Grounding check — for RAG agents, validate that the response is anchored in the retrieved sources (AgentCore supports this natively); (4) Sensitive information redaction — configure automatic PII redaction before logging (LGPD/GDPR compliance); (5) Immutable audit trail — use S3 with Object Lock (WORM) to store agent decision logs; configure CloudTrail to capture all AgentCore API calls.
What AgentCore solves that you don't want to build from scratch
Before AgentCore, assembling the infrastructure around a Bedrock agent required: a manual session management layer (DynamoDB + context logic), tool integration via Lambda with custom error handling, hand-assembled observability with X-Ray SDK, and agent credential management via Secrets Manager with manual rotation. Each of these pieces is individually solvable — the problem is that they need to work together coherently under load, with partial failures, and with cross-layer traceability.
AgentCore offers five managed primitives that eliminate this platform work: Runtime (loop execution with turn control, integrated session memory, and native Bedrock model integration); Gateway (tool proxy with MCP support, schema validation, and per-tool access control); Memory (session memory and long-term memory with automatic vectorization); Identity (OAuth 2.0 managed credentials for the agent, no hardcoded secrets); Observability (structured per-iteration traces, token metrics, and CloudWatch and X-Ray integration).
The honest trade-off: you gain implementation speed and operational consistency, but accept AWS coupling and the configuration limits of the managed service. For small teams or deadline-driven projects, AgentCore is the right choice. For large-scale agent platforms with very specific customization requirements, it may be worth building on lower-level Bedrock primitives. But that's a 1% problem for most teams.
Reference Architecture: Production Agent with AgentCore
Complete flow of one invocation: from external trigger to audited result, passing through the loop with verifier and stop rules. The five AgentCore primitives are highlighted as the managed zone.
- API Gateway · REST / WebSocket
- EventBridge · Event / Schedule
- SQS · Async Queue
- AgentCore Runtime · maxIterations / timeout / turns
- AgentCore Memory · Session + Long-term
- AgentCore Identity · OAuth2 / IAM Role
- AgentCore Gateway · MCP / Schema Validation
- AgentCore Observability · Trace / Tokens / Latency
- Bedrock Model · Claude / Titan / Llama
- Stop Rule Check · code-verifiable condition
- Tool Lambda · IAM scoped / allowlist
- External API · (via Gateway proxy)
- Deterministic Verifier · Lambda / schema / invariants
- LLM Verifier · Smaller model / eval prompt
- AgentCore Guardrail · PII / topics / grounding
- CloudWatch Logs · 90d retention / per-agent group
- X-Ray Traces · per-iteration spans
- S3 Object Lock · WORM audit trail
- CloudWatch Dashboard · cost / iterations / p99
Anti-patterns that will bite you in production
1. Loop without budget: Agent configured without maxIterations and without explicit timeout. In production, an ambiguous prompt can make the agent iterate indefinitely until the Lambda timeout (15 min) or Fargate — generating token cost without useful result and without alarm. Always define maxIterations in the Runtime and an independent timeout in the hosting infrastructure.
2. Verifier = the agent itself: Using the same model with the same context to both generate and evaluate the output is the most common anti-pattern. The model will confirm its own output almost always — you've created a self-validation system with systematic positive bias. Use a second smaller model, a deterministic Lambda, or a configured guardrail.
3. Tool with excessive privilege: A tool with s3: or dynamodb: turns the agent into an attack vector. If the model is induced (prompt injection via external data) to call the tool with malicious parameters, the damage is proportional to the privilege. Apply IAM with specific resource ARN and condition keys. Use AgentCore Gateway to validate input schema before executing.
4. Observability only at invocation level: Logging only the final input and output hides what happened inside the loop. When an agent fails after 8 iterations, you need to know what happened at iteration 5. Configure per-iteration traces from the start — retrofitting is much more expensive.
5. Multi-agent as first choice: Multi-agent multiplies debugging complexity, token cost, and failure surface. Start simple. Promote to multi-agent only when a single agent demonstrably doesn't solve it.
Rule of Thumb
If the agent has no budget, no external verifier, and no per-iteration trace, it's not in production — it's in staging with real traffic. Or shorter: no stop rule + no verifier + no per-iteration trace = prototype, not product.
After working with financial systems where an incorrect action has real and immediate consequences, I developed a reflex: I never trust an agent I can't audit iteration by iteration. It's not paranoia — it's the same discipline we apply to any system that makes decisions with side effects. In practice, when I start a new agent, I ask three questions before writing any code: (1) What happens if the agent iterates 10x more than expected? (2) Who validates that the output is correct — and is that validator independent of the generator? (3) If this agent is compromised by prompt injection, what is the maximum blast radius with the permissions it has? If I can't answer all three in 5 minutes, the design isn't mature enough to start implementing. On AgentCore specifically: it's genuinely useful for teams that don't want to assemble the platform layer from scratch. The Gateway with MCP is the piece that impressed me most — solving the tool registration and validation problem in a standardized way is something I saw teams reimplementing differently in every project. The AWS coupling trade-off is real, but for most enterprise use cases, we're already deeply coupled anyway. What I don't do: I don't use multi-agent until I have concrete evidence that a single agent doesn't solve it. The debugging complexity of multi-agent systems in production is substantially higher — and most problems I see being solved with multi-agent could be solved with a single agent and well-defined tools.
Verdict
Most AI agents don't fail because the model is bad — they fail because the infrastructure around the loop was not designed. A production agent is a distributed system with a non-deterministic component at its center: it needs a budget, external verification, per-iteration observability, and guardrails before touching real traffic. Amazon Bedrock AgentCore provides the right primitives to assemble this structure without reinventing platform. The 7 steps in this playbook are the minimum sequence to go from notebook to production responsibly. Skip any one of them and you're operating a prototype with real traffic — and the difference will show up at the worst possible moment.
Post-mortems, ADRs and architecture deep dives in your inbox — the way an architect reads them.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this study from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.