Playbook: From Prompt to Pipeline — the 5 Stages of a Reliable Agent
Listen to study
generated on playGenerated only on first play
Powered by Amazon Polly + OmniVoice
An agent is not born reliable — it is built stage by stage, from a simple prompt to an operable system with observability, guardrails, and audit trails. This playbook maps each stage, what you gain, what you risk, and when to stop climbing. The bottleneck is not the prompt; it is the system around it.
Everyone starts with a prompt. Few end up with an agent that runs in production without blowing the budget, looping forever, or taking actions nobody authorized. The problem is not the model — it is the absence of structure around it.
What you will be able to decide after reading this
Base references for this playbook
- ReAct pattern origin
- Yao et al., 2022 — Princeton / Google Brain
- Agent design reference
- Anthropic Engineering — Building Effective Agents (2024)
- Runtime platform
- Amazon Bedrock AgentCore (GA 2025)
- Application domain
- Agnostic — pattern applicable to any LLM with tool use
- Playbook scope
- Stages 1–5: from simple prompt to operable pipeline
- Estimated cost of skipping Stage 5
- Uncontrolled loops can generate hundreds of API calls in minutes (estimate based on public cases)
The mental model that unlocks everything: an agent is a system, not a call
Most teams approach the LLM with a REST API mindset: you send an input, you get an output, done. That works for Stage 1. The problem is that this mindset persists when the system gains tools, loops, and autonomy — and that is when accidents happen.
The conceptual shift is simple: the moment the LLM can act in the world — call an API, write to a database, send an email — you no longer have a language model. You have an agent. And agents are distributed systems with state, side effects, and non-deterministic failures.
That changes everything. A distributed system needs:
- Execution limits (stop rules, token budgets, max iterations)
- Observability (traces, spans, cost metrics per run)
- Privilege control (the agent can only do what it needs to do — not everything it is capable of doing)
- External verification (the fact that the LLM said the answer is correct does not mean it is)
- Audit trails (who authorized, what was executed, what was the output)
Yao et al.'s ReAct paper (2022) formalized the Observe→Reason→Act loop as the core of an agent. What the paper does not solve — and no paper solves — is what happens when that loop runs in production with real data, real permissions, and real users. That is the engineering work that Stages 3, 4, and 5 cover.
Anthropic, in the effective agents guide, puts it plainly: 'the added complexity of agentic workflows is only worthwhile when the task requires flexibility or multi-step reasoning that fixed workflows cannot deliver.' Translation: do not climb the ladder out of technical vanity. Climb because the problem demands it.
The 5 Stages — what you add, what you gain, what you risk
- 1
Stage 1 — Simple Prompt
What it is: A single LLM call. Input → Output. No tools, no loop, no memory. What you add: A well-structured prompt (system prompt, examples, clear format instruction). What you gain: Maximum iteration speed. Predictable cost. Zero side effects. Ideal for classification, summarization, entity extraction, draft generation. What you risk: Nothing beyond a bad response. The risk is quality, not system integrity. When to stop here: If the problem fits in one call with enough context, stop here. Seriously. Most enterprise use cases fit this stage with careful prompt engineering. Testability: Direct unit test — fixed input → output evaluated by deterministic criterion or by another LLM as judge.
- 2
Stage 2 — Prompt + Tools (Tool Use)
What it is: The LLM can call external functions (APIs, searches, calculators, databases). This is where the agent is born. What you add: Tool definitions (JSON schemas), tool executor, tool call error handling. What you gain: The model now acts in the world. Real-time queries, code execution, integration with external systems. What you risk: Privilege escalation. The agent has the permissions of the identity executing the tools. If that identity has write access in production, so does the agent. Least-privilege principle is mandatory here — not optional. Practical rule: Each tool should have a separate IAM role with minimum scope. In Bedrock AgentCore, use resource-based policies per tool. Never pass admin credentials to the tool executor.
- 3
Stage 3 — Loop (ReAct: Observe → Reason → Act → Repeat)
What it is: The agent iterates. It observes the result of an action, reasons about the next step, acts again. This is the ReAct pattern from Yao et al. (2022). What you add: Execution loop, context management between iterations, stop criterion (stop rule). What you gain: Ability to solve multi-step problems. The agent can correct intermediate errors, explore alternative paths, decompose complex tasks. What you risk: Loss of predictability. Without explicit stop rules, the agent can loop indefinitely. Without a token budget, cost scales non-linearly. Without an iteration limit, a simple task can turn into a 200-API-call race. Mandatory stop rules: 1.
max_iterations— maximum number of loop cycles (I recommend starting with 10) 2. - 4
Stage 4 — Verifier (External Verification)
What it is: A separate component — which can be another LLM call, a deterministic function, or a rule set — that validates the agent's output before it is delivered or before an irreversible action is executed. What you add: Independent verifier, explicit validation criteria, approval gate for high-impact actions. What you gain: The separation between 'the agent believes it is correct' and 'the output passed verifiable criteria'. This is the watershed for production. An agent without a verifier is a system that trusts itself — and LLMs are notoriously confident even when wrong. What you risk: Additional latency. Additional cost (if the verifier is another LLM). Maintenance complexity for validation criteria.
- 5
Stage 5 — Pipeline (Operable System)
What it is: The agent becomes a system you operate. It has a trigger (event, schedule, webhook), full observability, guardrails, cost budget, audit trail, and incident runbook. What you add: Trigger mechanism, distributed tracing (spans per iteration and per tool call), cost tracking per run, input/output guardrails, immutable audit log, anomaly alerts, runbook. What you gain: A system you can debug, monitor, bill, audit, and scale. Without this, you have a prototype in production — not a product. What you risk: Significant operational complexity. A poorly instrumented pipeline is worse than no pipeline — you have a false sense of control.
Why Stage 4 is the most underestimated
Teams that reach Stage 3 are usually excited. The loop works. The agent solves multi-step problems. The demos are impressive. And then they jump straight to Stage 5 — instrumentation, pipeline, deploy.
The problem: they skipped Stage 4.
Without an external verifier, the system trusts the LLM's self-assessment. And LLMs are optimistic by design — they were trained to produce coherent and confident responses. A model that says 'I verified and the result is correct' is not equivalent to a system that independently verified.
Anthropic documents this explicitly: in agentic pipelines, errors accumulate. An error at step 3 of a 10-step process can propagate and amplify in subsequent steps. The verifier breaks that propagation.
In practice, the simplest verifier that works is a deterministic function: 'does the output have all required fields? Are the values within expected ranges? Is the JSON valid?' That is not glamorous, but it catches most production errors.
For higher-risk cases — financial actions, external communications, data modifications — the verifier needs to be a human gate or a second model with an adversarial prompt ('find the errors in this response'). The cost of a second LLM call is trivial compared to the cost of an incorrect action in production.
A heuristic I use: if you cannot write the verifier criteria before building the agent, you do not understand the problem well enough to automate it.
Maturity matrix by stage
| Stage | Autonomy | Predictability | Cost per run | Main risk | |
|---|---|---|---|---|---|
| 1 — Prompt | None | High | Fixed and predictable | Output quality | Tools + schemas |
| 2 — Prompt + Tools | Low (1 cycle) | High | Low + tool cost | Privilege escalation | Loop + stop rules |
| 3 — Loop (ReAct) | Medium (multi-step) | Medium | Variable (explosion risk) | Infinite loop, uncontrolled cost | External verifier |
| 4 — Verifier | Medium-high | High (with gates) | Medium + verifier cost | False positive in verifier | Observability + guardrails + audit |
| 5 — Pipeline | High (operable) | High (with alerts) | High (infra + observability) | Operational complexity, false sense of control | — (top of the ladder) |
The Ladder: from Prompt to Pipeline
Each stage adds components around the LLM. The model itself does not change — the system around it is what evolves. Read bottom to top: each layer presupposes the previous one.
- Trigger · EventBridge / SQS
- Orquestrador · Step Functions
- Guardrails · Bedrock Guardrails
- Audit Log · CloudTrail + S3 Lock
- Alertas · CloudWatch Alarms
- Verifier · Determinístico / LLM-judge / Human gate
- Approval Gate · Ações irreversíveis
- Loop Controller · max_iter / timeout / budget
- Context Manager · Memória entre iterações
- Tool Executor · IAM least-privilege
- Tools · API / DB / Search / Code
- LLM · Bedrock / Claude / etc.
- Prompt · System + User + Examples
- Usuário / Sistema · Chamador
What Bedrock AgentCore solves — and what it does not
Amazon Bedrock AgentCore (GA 2025) is AWS's bet to solve the infrastructure of Stages 2 through 5 in a managed way. It is worth understanding what it delivers and where you still need to do the work.
What AgentCore solves:
- Session and memory management: persistent context between iterations without you managing DynamoDB underneath
- Tool execution runtime: managed executor with retry, timeout, and automatic tool call logging
- Bedrock Guardrails integration: native input/output filtering, including PII detection and prohibited topics
- Basic observability: execution traces integrated with CloudWatch, with
agentId,sessionId, and spans per tool call - Access control: IAM integration to define which tools the agent can call
What AgentCore does NOT solve — and you need to build:
- External verifier (Stage 4): AgentCore does not have a native independent verification component. You implement this as a Lambda or Step Functions state before returning output
- Approval gates for irreversible actions: requires manual integration with SNS/SES for human notification or Step Functions with
.waitForTaskToken - Granular cost tracking per run: AgentCore logs tokens, but aggregating cost per business
runIdrequires custom instrumentation - Business stop rules:
max_iterationsis configurable, but stop logic based on business criteria (e.g., 'stop if the result is already good enough') is your responsibility - Immutable audit log: CloudTrail captures API calls, but a business audit log with the agent's reasoning requires explicit writing to S3 with Object Lock
The practical conclusion: AgentCore significantly reduces the work to reach Stage 3. It does not replace the engineering of Stages 4 and 5. Use it as an accelerator, not a complete solution.
Anti-patterns that kill agents in production
1. Moving to Stage 3 without stop rules. The loop works in the demo because the demo has 3 iterations. In production, with real data and edge cases, the agent can iterate 50 times before you notice. Define max_iterations, max_tokens_total, and timeout before enabling the loop.
2. Giving the agent the developer's permissions. The tool executor runs with the credentials of whoever configured it. If you tested with your admin IAM role, the agent has admin access. Create a dedicated role with minimum scope before the first deploy.
3. Skipping Stage 4 and going straight to 5. Instrumentation does not replace verification. You can have perfect traces of an agent that is confidently producing incorrect outputs.
4. Using the same agent for read and write. Separate query agents (read-only) from action agents (write). This limits the blast radius of unexpected behavior.
5. Not having an incident runbook. When the agent gets stuck in production at 2am, you need to know how to stop the execution, how to revert actions, and who to notify. Document this before go-live.
6. Trusting the LLM's own 'I verified'. Self-verification in LLMs is notoriously unreliable. The model that generated the output should not be the only one evaluating it.
Rule of thumb
'If you cannot describe the stop criterion and the success criterion before building, you are not ready for the next stage.' Stop criterion = when the agent should stop (stop rules, budget, timeout). Success criterion = how you know the output is correct (verifier criteria). Without both written explicitly, you are building in the dark.
After working with financial-grade systems where an incorrect output has real consequences, I developed a clear stance: I treat agents as high-risk systems from Stage 2 — not from the moment something goes wrong. In practice, this means: I never move to Stage 3 without stop rules documented and tested. This is not bureaucracy — it is the difference between a $0.50 run and a $500 run. I have seen loops that consumed monthly budgets in hours because someone assumed 'the model will stop when it is done'. I implement the verifier before enabling write actions. Even if it is a simple verifier — JSON schema validation, required field checklist — it needs to exist before any tool that modifies state. The cost of a second LLM call for verification is trivial; the cost of reverting an incorrect action in production is not. For financial systems or sensitive data, Stage 4 mandatorily includes a human-in-the-loop for actions above a threshold. No matter how good the agent is — there are decisions that require a human signature. This is not a technical limitation; it is governance. My criterion for recommending Bedrock AgentCore: if you need to reach Stage 3 quickly and have a small team, AgentCore is worth the lock-in. It solves the plumbing of session, memory, and tool execution. But if you have regulatory audit requirements or need cross-cloud portability, build the runtime yourself on primitives (Lambda + DynamoDB + SQS) — you will have more control over what goes into the audit log. The insight that most changes the conversation with stakeholders: showing the stage ladder as a maturity map, not a feature list. When the CTO sees that 'reliable agent' is Stage 5 and that skipping stages has concrete costs, the conversation about timeline and resources becomes much more honest.
Well-Architected lenses by stage
Security
Stage 2 mandates least-privilege per tool. Stage 5 requires immutable audit log and input/output guardrails. Never share credentials between tools with different scopes.
Reliability
Stop rules (Stage 3) and verifier (Stage 4) are the primary reliability mechanisms. Without them, the system has no way to detect or contain cascading failures.
Performance efficiency
Each stage adds latency. Measure P95 latency per stage. The LLM-judge verifier can add 2-5s — evaluate whether the use case tolerates this.
Sustainability
Unnecessary loops consume energy and generate cost without value. Well-defined stop rules and success criteria reduce wasted iterations.
Conclusion
The 5-stage ladder is not a mandatory linear progression — it is a decision map. Most enterprise problems do not need to reach Stage 5. What every problem needs is for you to know which stage you are at and why. The most expensive mistake is not staying at Stage 1 when the problem demands Stage 3. The most expensive mistake is moving to Stage 3 without Stage 3's stop rules, or to Stage 5 without Stage 4's verifier. You accumulate autonomy without accumulating control — and then the system does not fail predictably; it fails surprisingly. The bottleneck is not the prompt. It never was. The bottleneck is the system around it — and that system you build stage by stage, with intention, or you do not build it at all.
Post-mortems, ADRs and architecture deep dives in your inbox — the way an architect reads them.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this study from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.