# Design Doc: Multi-Agent Orchestration with Amazon Bedrock and Step Functions

This document proposes a multi-agent orchestration architecture using Amazon Bedrock Agents in a supervisor/worker topology, with Step Functions managing state, retries, and human-in-the-loop. The focus is on separating reasoning responsibilities (LLM) from orchestration responsibilities (state flow), applying security guardrails, and controlling operational cost deterministically.

- URL: https://fernando.moretes.com/studies/design-doc-orquestracao-multi-agente-bedrock

- Markdown: https://fernando.moretes.com/studies/design-doc-orquestracao-multi-agente-bedrock/study.md?lang=en

- Type: Design Doc / RFC

- Company: Orquestração de agentes (cenário)

- Domain: IA / Agentes

- Date: 2026-03-03

- Tags: bedrock, step-functions, multi-agent, orchestration, guardrails, human-in-the-loop, event-driven, ai

- Reading time: 11 min

---

AI agents are useful individually. In composition, without a serious orchestration layer, they become an unsolved engineering problem: infinite loops, uncontrolled costs, no audit trail, and silent failures. This document defines how to build that layer in a way that is auditable, reversible, and with a guaranteed cost ceiling.

## The Problem: LLMs Are Not Orchestrators

The most common pattern I see in agent projects is delegating to the LLM itself the decision of when to call which tool, in what order, and when to stop. That works in demos. In production, it's a bet.

The core problem is that an LLM has no persistent state, no idempotency guarantees, no visibility into accumulated cost, and no native rollback mechanism. When you chain multiple agents — a supervisor delegating to specialized workers (search, calculation, writing, validation) — each hop adds latency, tokens, and failure surface. A worker that hangs or returns garbage can cause the supervisor to enter a retry loop with no circuit breaker.

The scenario motivating this document is a complex request processing system in a financial-operational context: the user submits a request requiring internal data research, risk calculation, document draft generation, and human approval before final execution. No single agent handles this well. Composition of specialized agents is the right answer — but composition needs an external state engine, not more prompt engineering.

The thesis of this design is simple: **the LLM decides what, Step Functions decides when and how**. The Bedrock Agent is the reasoning executor within each node of the flow; Step Functions is the control graph that defines transitions, timeouts, retries, compensations, and human approval checkpoints. This separation is not merely architectural — it is the difference between a system you can operate and one you hope works.

## Goals and Non-Goals

- ✅ GOAL: Define a supervisor/worker topology with clear responsibilities between Bedrock Agents and Step Functions
- ✅ GOAL: Guarantee idempotency across all agent and tool invocations (tool calls)
- ✅ GOAL: Implement human-in-the-loop as a native control point in the flow, not as a workaround
- ✅ GOAL: Apply Bedrock Guardrails for content filtering, PII, and denied topics at all input/output boundaries
- ✅ GOAL: Establish a deterministic cost ceiling via token control, per-execution timeout, and iteration limits
- ✅ GOAL: Full observability: tracing every step, every tool call, every supervisor decision

## Scenario Fact Sheet

- **Scenario:** Complex request processing with multiple specialized agents
- **Domain:** AI / Agents — financial-operational context
- **Core services:** Amazon Bedrock Agents, AWS Step Functions (Express + Standard), AWS Lambda, Amazon DynamoDB, Amazon SNS
- **Orchestration model:** Supervisor agent → specialized workers (4 workers: research, calculation, drafting, validation)
- **Step Functions type:** Standard Workflows for the main flow (audit, human-in-the-loop); Express Workflows for short-duration sub-tasks
- **Guardrails:** Bedrock Guardrails applied at all agent input and output boundaries
- **Cost ceiling (estimate):** Controlled by: max_tokens per invocation, timeout per step (120s), maximum 3 iterations per worker, accumulated cost alarm via CloudWatch
- **Target region:** us-east-1 (Bedrock model availability) with passive DR in us-west-2

## Proposed Design: Plane Separation

The architecture is organized into three distinct planes with non-overlapping responsibilities.

**Control Plane: Step Functions Standard Workflow**

The main flow is a Standard Workflow — a deliberate choice because we need complete execution history (up to 90 days), exactly-once semantics for critical steps, and native `waitForTaskToken` support for human-in-the-loop. Standard Workflow has cost per state transition (not per duration), which is appropriate for flows that may wait hours for human approval.

Within the main flow, short-duration sub-tasks (such as invoking a calculation worker that returns in seconds) are delegated to nested Express Workflows — cheaper for high frequency and with higher throughput. The Standard → Express composition is a natively supported pattern and solves the cost problem without sacrificing audit at the main flow level.

**Reasoning Plane: Bedrock Agents**

Each agent (supervisor and workers) is a Bedrock Agent with a defined set of Action Groups (tools). The supervisor does not execute tools directly — it receives the request, decomposes it into sub-tasks, and emits delegation tokens that Step Functions interprets to route to the correct workers. This differs from letting the supervisor call workers as tools: Step Functions maintains control of which worker was invoked, with which parameters, and the returned result — all auditable.

Each worker has a restricted tool scope: the research worker accesses only the Knowledge Base via Bedrock; the calculation worker accesses only financial calculation Lambda functions; the drafting worker has no access to external data; the validation worker accesses only compliance rules in DynamoDB. This least-privilege principle at the agent level is enforced via distinct IAM roles per agent.

**Security Plane: Guardrails and IAM**

Bedrock Guardrails are configured as a cross-cutting layer applied to all invocations — both at the supervisor input (user input sanitization) and at each worker's output before being passed to the next step. Guardrails cover: content filtering by category (hate, violence, sexual, insults), PII detection and masking (CPF, CNPJ, banking data in the financial context), and denied topics list (instructions for operations outside the system scope).

The decision to apply guardrails on worker outputs — not just the final output — is intentional: a compromised or hallucinating worker should not contaminate the context of subsequent workers. The additional guardrail cost per invocation is acceptable given the financial context.

## Multi-Agent Orchestration Topology

Complete flow from request submission to final execution, including human-in-the-loop, guardrails at all boundaries, and state control via Step Functions.

### 👤 Entry

- User / System API caller (user)
- API Gateway REST endpoint (edge)

### 🔀 Orchestration — Step Functions

- Standard Workflow Main orchestration (compute)
- Express Workflow Short sub-tasks (compute)
- waitForTaskToken Human approval (compute)

### 🤖 AI — Bedrock Agents

- Supervisor Agent Decompose & route (ai)
- Worker: Research Knowledge Base (ai)
- Worker: Calculation Financial logic (ai)
- Worker: Drafting Doc generation (ai)
- Worker: Validation Compliance check (ai)

### 🛡️ Security

- Bedrock Guardrails PII / content / topics (security)
- IAM Roles Per-agent least privilege (security)

### ⚙️ Tools & Data

- Lambda Calc functions (compute)
- Bedrock Knowledge Base (data)
- DynamoDB Compliance rules + execution state (storage)
- SNS Approval notification (messaging)

### 📊 Observability

- X-Ray Distributed tracing (compute)
- CloudWatch Metrics + cost alarm (compute)

### Flows

- user -> apigw: Submit request
- apigw -> sfn_main: StartExecution
- sfn_main -> guardrails: Input sanitization
- guardrails -> supervisor: Sanitized input
- supervisor -> sfn_main: Delegation tokens
- sfn_main -> sfn_express: Sub-task dispatch
- sfn_express -> worker_research: InvokeAgent
- sfn_express -> worker_calc: InvokeAgent
- sfn_express -> worker_draft: InvokeAgent
- sfn_express -> worker_valid: InvokeAgent
- worker_research -> kb: Retrieve
- worker_calc -> lambda_calc: Tool call
- worker_valid -> dynamo: Rules lookup
- worker_research -> guardrails: Output check
- worker_calc -> guardrails: Output check
- worker_draft -> guardrails: Output check
- sfn_main -> hitl: waitForTaskToken
- hitl -> sns_hitl: Notify approver
- sns_hitl -> sfn_main: SendTaskSuccess/Failure
- sfn_main -> dynamo: Persist execution state
- sfn_main -> xray: Traces
- sfn_main -> cw: Metrics + cost
- iam -> supervisor: Role binding
- iam -> worker_research: Role binding

## Idempotency, Retries, and Cost Ceiling

These three topics are inseparable in agent systems and deserve explicit treatment.

**Idempotency**

Every Bedrock Agent invocation via Step Functions receives a `sessionId` derived from the workflow's `executionId` plus the step identifier. This ensures that if Step Functions re-executes a step due to timeout or transient failure, the agent invocation is identifiable as a duplicate. On the Bedrock side, the `sessionId` allows the agent's memory session to be reused without reprocessing the full context — which also reduces token cost.

For tool calls with side effects (database writes, external API calls), Lambda functions implement idempotency via a composite key `{executionId}#{stepName}#{toolCallIndex}` persisted in DynamoDB with a 24-hour TTL. If the Lambda is invoked twice with the same key, it returns the cached result without re-executing the operation. This pattern is especially critical for the financial calculation worker, where re-execution may produce different results due to market data variation.

**Retries with Exponential Backoff**

Step Functions configures retry on each agent invocation state with: `MaxAttempts: 3`, `IntervalSeconds: 2`, `BackoffRate: 2.0`, `MaxDelaySeconds: 30`. Errors of type `Bedrock.ThrottlingException` and `Lambda.TooManyRequestsException` are retried; errors of type `Bedrock.ValidationException` (invalid input after guardrail) are treated as definitive failures and transition to a compensation state without retry.

The compensation state is a critical point: it not only records the failure but executes a cleanup sequence — notifies the user, persists partial state in DynamoDB for possible manual resumption, and emits a categorized failure metric for analysis.

**Cost Ceiling**

The cost of an agent system is dominated by LLM tokens. I control this in three layers: (1) `maxTokens` configured per agent — the supervisor has a higher limit (4096 output tokens) because it needs more elaborate reasoning; workers have lower limits (1024-2048) because they have restricted scope. (2) Iteration limit per worker: maximum 3 reasoning cycles (ReAct loops) before Step Functions forces a transition to an escalation state. (3) Cost alarm in CloudWatch based on a custom metric of tokens consumed per execution, with a configurable threshold per request type.

The token metric is emitted by an instrumentation Lambda that intercepts Bedrock responses before returning them to Step Functions — an interceptor pattern that adds less than 5ms of latency but provides real-time cost visibility.

## Evaluated Orchestration Alternatives

### Native Bedrock Agent with sub-agents (Supervisor mode)

**Pros**
- Simpler configuration — everything within the Bedrock ecosystem
- Native integration with Knowledge Bases and Action Groups
- Session memory managed automatically

**Cons**
- No intermediate state visibility — black box for auditing
- No native human-in-the-loop support with flow pause
- No per-step cost control — hard to establish deterministic ceiling
- Limited retry logic, not configurable per error type

**Verdict:** Adequate for prototypes. Insufficient for financial production.

### LangChain / LangGraph self-hosted on ECS

**Pros**
- Maximum flexibility in topology and orchestration logic
- Rich ecosystem of integrations and abstractions
- Full control over orchestration code

**Cons**
- Additional infrastructure to manage (ECS, scalability, patching)
- Execution state requires own solution (Redis, DynamoDB) — no managed durability
- No native integration with Bedrock Guardrails
- Larger dependency and version maintenance surface

**Verdict:** Valid if the team has Python expertise and prefers portability. Higher operational cost.

### Step Functions + Bedrock InvokeModel direct (without Bedrock Agents)

**Pros**
- Full control over prompt and reasoning cycle
- No Agents framework overhead
- Simpler to debug — each call is explicit

**Cons**
- Tool calling and ReAct loop need to be implemented manually
- Context management and session memory are manual
- No native Guardrails — would need pre/post-processing Lambda

**Verdict:** Good option for simple cases. For multi-agent topology, implementation cost outweighs the benefit.

### Step Functions + Bedrock Agents (proposed design)

**Pros**
- Clear separation between orchestration (Step Functions) and reasoning (Bedrock Agents)
- Native human-in-the-loop via waitForTaskToken without additional infrastructure
- Complete execution audit in Step Functions + native Bedrock Guardrails
- Retry, timeout, and compensation configurable per step without additional code

**Cons**
- Higher initial configuration complexity (ASL + Agent definitions)
- Standard Workflow cost per transition can be significant at high volume
- Additional latency per hop between Step Functions and Bedrock API

**Verdict:** Chosen design. Best balance between operational control, auditability, and maintenance cost.

## Decision: Standard vs Express Workflow for the Main Flow

**Status:** accepted

**Context**

The main flow includes a human-in-the-loop point that may wait hours for approval. Express Workflows have a maximum duration of 5 minutes. Standard Workflows support executions up to 1 year and have exactly-once semantics.

**Decision**

Use Standard Workflow for the main flow and Express Workflows for short-duration sub-tasks invoked as nested states.

**Consequences**
- Cost per state transition in Standard Workflow — acceptable given expected volume (hundreds of executions per day, not millions)
- Complete execution history available in console and via API for up to 90 days
- waitForTaskToken works natively without workarounds
- High-frequency sub-tasks in Express Workflow reduce total cost by ~60% (estimate based on AWS public pricing)

## Phased Rollout Plan

1. **Phase 1 — Foundation (Weeks 1-2)** — Provision base infrastructure via IaC (Terraform or CDK): IAM roles per agent, DynamoDB for state and idempotency, Bedrock Guardrails configuration, Step Functions Standard Workflow with minimal flow (supervisor → 1 worker → result). Validate that guardrails block synthetic test inputs (PII, denied topics). No real traffic.

2. **Phase 2 — Workers and Tool Calling (Weeks 3-4)** — Implement the 4 workers with their Action Groups and tools. Configure Express Workflows for sub-tasks. Implement idempotency pattern in tool calling Lambdas. Integration tests with synthetic data covering: retry on throttling, worker failure with compensation, iteration limit reached. Measure end-to-end latency and token cost per request type.

3. **Phase 3 — Human-in-the-Loop and Observability (Week 5)** — Integrate waitForTaskToken with SNS for approver notification. Implement callback Lambda for SendTaskSuccess/Failure. Configure X-Ray tracing on all components. Create CloudWatch dashboard with metrics for: executions by status, latency per phase, tokens consumed, estimated cost per execution. Configure cost and failure rate alarms.

4. **Phase 4 — Pilot with Real Traffic (Weeks 6-7)** — Enable for 10% of real traffic via feature flag. Monitor response quality metrics (sampled human evaluation), escalation rate to human-in-the-loop, actual vs estimated cost, and failure rate by type. Adjust token limits, timeouts, and guardrail thresholds based on real data. Document identified edge cases.

5. **Phase 5 — GA and Operations (Week 8+)** — Full rollout. Operational runbook documented covering: how to inspect stuck executions, how to force failure or success on waitForTaskToken manually, how to adjust guardrails without redeploy, and rollback procedure to previous workflow version. Monthly cost review based on accumulated metrics.

> **Critical Risks and Mitigations:** **1. Infinite loop between supervisor and workers:** The Bedrock Agent in supervisor mode can, with certain ambiguous prompts, continue delegating to workers without converging. Mitigation: Step Functions enforces iteration limits via counter in state context; after N cycles, forced transition to human escalation state.

**2. Prompt injection via user input:** A malicious user may attempt to inject instructions in the input to bypass guardrails or manipulate the supervisor. Mitigation: Guardrails applied before any LLM processing; schema validation at API Gateway before starting the workflow.

**3. Uncontrolled token cost in production:** Complex requests may consume far more tokens than estimated in tests. Mitigation: CloudWatch alarm with cost threshold per execution; circuit breaker that suspends new executions if accumulated daily cost exceeds configured limit.

**4. Silent worker failure with plausible but incorrect response:** A worker may return a response that passes guardrails but is factually wrong (hallucination). Mitigation: Validation worker as mandatory step before final execution; human-in-the-loop for requests above risk threshold.

**5. Cold start latency in Bedrock Agents:** First invocations after inactivity periods may have higher latency. Mitigation: Keep sessions active via periodic warm-up invocations; configure generous timeout on the first workflow step.

## Well-Architected Assessment

- **security**: Minimum-scope IAM roles per agent; Bedrock Guardrails at all input/output boundaries; no credentials in prompts or environment variables; VPC endpoints for Bedrock and DynamoDB where applicable.
- **reliability**: Retry with exponential backoff configured per error type; explicit compensation on definitive failures; state persisted in DynamoDB for manual recovery; Standard Workflow guarantees exactly-once on critical steps.
- **performance**: Express Workflows for short-duration sub-tasks; max_tokens configured per agent to avoid unnecessarily long responses; sessionId reused to avoid context reprocessing.
- **sustainability**: max_tokens and iteration limits reduce unnecessary LLM compute consumption; session reuse avoids reprocessing; no idle resources (serverless-first).

## Success Metrics and Targets

- **Successful completion rate:** >= 95% of executions completed without engineering manual intervention
- **End-to-end latency (P95):** < 45 seconds for flows without human-in-the-loop (estimate based on Bedrock API benchmarks)
- **Cost per execution:** Within configured ceiling per request type; deviation > 20% triggers alarm
- **Guardrails activation rate:** Monitored per category; spike > 2x baseline triggers investigation of possible attack
- **Tracing coverage:** 100% of executions with complete X-Ray trace from API Gateway to final result
- **MTTR (Mean Time to Recover):** < 30 minutes for operational failures with documented runbook
- **HITL escalation rate:** Baseline to be established in pilot; target of 20% reduction per quarter via prompt improvement

> **My Perspective: The Most Common Mistake in Agent Systems:** Most agent projects I've seen fail in production didn't fail due to model limitations — they failed due to the absence of a serious orchestration layer. The team builds an agent that works well in tests, puts it in production, and three weeks later is debugging why an execution got stuck in a loop for 40 minutes and cost 15 dollars in tokens.

The instinct to let the LLM 'figure out the path' is understandable — it's what demos show. But in systems that need auditing, cost control, and human approval, you need an external control graph. Step Functions isn't bureaucracy — it's the difference between a system you can operate at 3am when something breaks and one you need to manually restart.

What I'd do differently from the market standard: apply guardrails on **worker outputs**, not just on system input and output. This seems excessive until you have the first case of a research worker returning data from the wrong customer that contaminates the drafting worker's context. The additional guardrail cost is less than the cost of a data incident.

Another thing I'd emphasize: the **validation worker is not optional**. In any context with real consequences (financial, health, legal), the agent pipeline output needs to go through a validation step before execution. Don't trust the supervisor to detect worker errors — it has no ground truth visibility, only what the workers returned.

Finally: measure cost per execution from day one. You'll be surprised by the variance. Requests that seem similar can differ 5x in token cost depending on the complexity of reasoning required. Without this visibility, you can't optimize or predict the bill.

## Verdict

The supervisor/worker topology with Bedrock Agents and Step Functions is the correct design for agent systems with auditing, cost control, and human approval requirements. The separation between the reasoning plane (Bedrock) and the control plane (Step Functions) is not over-engineering — it is what makes the system operable.

The three principles underpinning this design are: (1) **the LLM decides what, Step Functions decides when and how**; (2) **guardrails at all boundaries, not just at the edges**; (3) **idempotency is not optional when there are side effects**.

The additional cost of Standard Workflow per transition and guardrails per invocation is real and must be measured. For volumes of hundreds of executions per day in a financial context, that cost is justified by what you gain in auditability and risk control. For volumes of millions of simple executions, the architecture would need to be revisited — possibly with Express Workflows throughout and selective guardrails.

The most underestimated risk in this type of system is silent worker hallucination that passes guardrails. The validation worker and human-in-the-loop for high-risk cases are the only effective mitigations — and both need to be in the design from the start, not added later as a fix.

## References

- [Amazon Bedrock Agents — AWS](https://aws.amazon.com/bedrock/agents/)
- [AWS Step Functions Documentation](https://docs.aws.amazon.com/step-functions/)
- [Bedrock Agents — Multi-agent collaboration](https://docs.aws.amazon.com/bedrock/latest/userguide/agents-multi-agent-collaboration.html)
- [Amazon Bedrock Guardrails](https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html)
- [Step Functions — Standard vs Express Workflows](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-standard-vs-express.html)
- [Step Functions — Wait for a Callback with the Task Token](https://docs.aws.amazon.com/step-functions/latest/dg/connect-to-resource.html#connect-wait-token)
- [Bedrock Agents — Action Groups](https://docs.aws.amazon.com/bedrock/latest/userguide/agents-action-group.html)
- [AWS Step Functions — Error Handling](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html)

## Case sources

- [AWS — Amazon Bedrock Agents](https://aws.amazon.com/bedrock/agents/)
- [AWS — Step Functions](https://docs.aws.amazon.com/step-functions/)
