# Playbook: Multi-agent — when 1 is enough and when you need to orchestrate

Every additional agent doubles cost, latency, and error surface. This playbook provides surgical criteria for choosing between single agent, supervisor, pipeline, and debate topologies — with decision tables, visual diagrams, and the traps that destroy multi-agent systems in production.

- URL: https://fernando.moretes.com/studies/playbook-multi-agente-quando-1-basta-e-quando-orquestrar

- Markdown: https://fernando.moretes.com/studies/playbook-multi-agente-quando-1-basta-e-quando-orquestrar/study.md?lang=en

- Type: Playbook

- Domain: IA / Agentes

- Date: 2025-12-10

- Tags: multi-agent, bedrock, genai, orchestration, llm, aws, agents, architecture

- Reading time: 9 min

---

The pressure to add agents is almost always architecture-by-anxiety, not by necessity. A well-configured single agent with the right tools solves 80% of cases — faster, cheaper, and with fewer failure points. This playbook cuts through the hype with clear criteria: you'll leave knowing exactly when to stop at 1 and when multi-agent orchestration is the right answer.

## What you'll be able to decide after reading this

- Diagnose whether your case genuinely requires multiple agents or if it's accidental complexity
- Choose between the four topologies (single, supervisor, pipeline, debate) with technical criteria
- Understand the real cost of each additional agent in tokens, latency, and error surface
- Identify the 'auto-verifier' trap — why agents evaluating each other is not verification
- Implement the right topology on Amazon Bedrock with proper guardrails and isolation

## Reference context for this playbook

- **Domain:** Generative AI / LLM Agent Systems
- **Primary platform:** Amazon Bedrock Agents + AWS Step Functions
- **Models covered:** Claude 3.x (Anthropic via Bedrock), Titan models, any model with tool use support
- **Baseline cost reference (estimate):** Each agent hop adds ~1-3s latency and doubles context token consumption
- **Primary sources:** AWS Bedrock Docs, Anthropic Engineering Blog, AWS Step Functions
- **Audience:** Engineers and architects building agentic systems in production

## The mental model that unlocks everything: an agent is a loop, not a service

An LLM agent is fundamentally a reasoning loop: it perceives current state, decides an action (call a tool, respond, or keep thinking), executes, observes the result, and iterates. This is the ReAct pattern in its purest form. The problem starts when architects treat agents like microservices — each with a responsibility, each talking to the next via messages. The analogy is seductive but false.

Microservices have deterministic interfaces. Agents have probabilistic interfaces. When you connect two microservices, you know exactly what will flow between them. When you connect two agents, you're passing natural language — and every hop is an opportunity for hallucination, context loss, or goal drift. Error propagation in multi-agent systems is not linear: it's multiplicative.

Anthropic articulates this precisely in their effective agents guide: prefer simple systems. Increase complexity only when simplicity demonstrably fails. This isn't conservatism — it's engineering. A single agent with 5 well-defined tools (vector search, code execution, API calls, document reading, structured writing) covers most knowledge workflows I see in production. Claude 3.5 Sonnet's context already supports 200k tokens — a lot fits in a single well-structured prompt.

So what's the real criterion for scaling to multiple agents? **Genuinely distinct roles that cannot coexist in the same context.** Not 'different tasks' — roles with fundamentally incompatible objectives, permissions, and reasoning styles. A strategic planner and a code executor have different incentives. A critic and a generator have adversarial roles by design. That's real. 'I need one agent for research and another for writing' usually isn't — that's one agent with two tools.

## The real cost of every agent you add

Before any decision table, you need to internalize the marginal cost math. Every additional agent in your topology doesn't add cost linearly — it multiplies across three dimensions:

**1. Tokens:** Every handoff between agents requires the receiving agent to get enough context to understand what's being asked. In practice, agent A's output becomes part of agent B's input, which becomes part of agent C's input. In supervisor topologies with 3 sub-agents, you can easily be paying 4-6x the tokens of an equivalent single-agent solution. On Bedrock, this translates directly to inference cost.

**2. Latency:** Every agent hop is a synchronous inference call (or async with polling). With Claude 3.5 Sonnet, a typical reasoning call takes 2-8 seconds depending on complexity. A 4-agent serial pipeline has minimum latency of ~8-32 seconds, not counting tool execution time. For interactive applications, this is unacceptable. For batch processing, it may be tolerable.

**3. Error surface:** Every agent is an independent failure point. It can hallucinate, misinterpret the previous agent's instruction, call the wrong tool, or enter a loop. In multi-agent systems, debugging is exponentially harder because you need to reconstruct each agent's state at the moment of failure. Bedrock Agents traces help, but don't eliminate the problem.

The practical rule I use: before adding an agent, ask 'can I solve this with an additional tool on the existing agent?' If yes, don't add the agent. Tools are deterministic, cheap, and easy to test. Agents are probabilistic, expensive, and hard to debug.

## Agent topologies: pros, cons, and verdict

### Single Agent

**Pros**
- Minimum latency — a single reasoning loop
- Predictable and controllable cost
- Simple debugging: one trace, one context
- No context loss between hops
- Tools cover specialization without agent overhead

**Cons**
- Context limited for very long tasks (even with 200k tokens)
- Doesn't natively parallelize independent work
- Conflicting roles in the same prompt can degrade quality

**Verdict:** Default for 80% of cases. Always start here.

### Supervisor + Sub-agents

**Pros**
- Clear delegation of responsibility by domain
- Sub-agents can have isolated permissions and tools
- Natively supported in Amazon Bedrock multi-agent collaboration
- Supervisor maintains strategic context while sub-agents execute

**Cons**
- 2-3x token cost vs equivalent single agent
- Accumulated latency: each sub-agent adds 1 inference round-trip
- Supervisor can make suboptimal delegation decisions
- Debugging requires correlating traces across multiple agents

**Verdict:** Use when domains have genuinely distinct permissions or when sub-tasks are long enough to justify separate contexts.

### Pipeline (Sequential)

**Pros**
- Each stage transforms the previous output deterministically
- Easy to test stage by stage
- Integrable with Step Functions for durable orchestration and retry
- Clear separation of concerns: extraction → analysis → formatting

**Cons**
- Latency is the sum of all stages — no parallelism
- Error in intermediate stage invalidates all previous work
- Original problem context can get lost along the chain

**Verdict:** Ideal for well-defined data transformations (ETL with LLM, structured document generation). Don't use for open-ended reasoning.

### Debate (Adversarial Multi-agent)

**Pros**
- Genuinely adversarial critic improves quality of high-stakes outputs
- Detects biases and blind spots a single agent misses
- Useful for code generation with independent verification

**Cons**
- 3-5x token cost — the most expensive of all
- Agents from the same base model have correlated biases — debate is not real verification
- Can converge to false consensus instead of truth
- Prohibitive latency for interactive cases

**Verdict:** Use with surgical precision: only when the cost of a wrong output far exceeds the cost of debate. Real verification requires code, not another LLM.

## Cost, latency, and risk by topology
| Criterion | Topology | Token cost (relative) | Typical latency | Error surface | When to use |
| --- | --- | --- | --- | --- | --- |
| Single Agent | 1x (baseline) | 2–8s per call | Low — 1 failure point | Task fits in one reasoning line; tools cover specialization | Never by default — evaluate first |
| Supervisor | 2–4x | 5–20s (+ sub-agents) | Medium — N+1 failure points | Domains with distinct permissions; long and independent sub-tasks | When tools solve the problem; latency is critical |
| Pipeline | 1.5–3x | Sum of stages (10–60s) | Medium — cascade error possible | Well-defined sequential transformations; LLM ETL; batch | Open-ended reasoning; interactive cases; when stages are interdependent |
| Debate | 3–6x | 15–60s+ per round | High — correlated bias between agents | Very high-stakes outputs; critical code generation + verification | Factual verification (use code/tools); cost-sensitive cases; real-time |

## Checklist: how to decide your topology in 6 steps

1. **Step 1: Write the objective in one sentence** — If you can't describe what the system needs to do in one clear sentence, you don't understand the problem yet. Don't architect. Good examples: 'Given a support ticket, classify, retrieve relevant documentation, and draft a response.' Bad examples: 'An intelligent system that autonomously processes information.'

2. **Step 2: List required actions — are they tools or agents?** — For each action, ask: is it deterministic with a well-defined interface? → Tool (Lambda, API, SQL query). Does it require autonomous reasoning with its own state and objective? → Agent candidate. In most cases, 90% of 'actions' are tools. If everything is a tool, you have a single agent.

3. **Step 3: Check for real parallelism** — Are there sub-tasks that can execute simultaneously with independent results? If yes, parallel pipeline or supervisor with parallel sub-agents may justify the overhead. If tasks are sequential or interdependent, the parallelism is illusory and the overhead doesn't justify itself.

4. **Step 4: Check for mandatory security isolation** — Do different parts of the system need distinct IAM roles, segregated data access, or compliance requiring context separation? This is one of the few solid technical reasons for multiple agents. In Bedrock, each agent has its own execution role — use this for real isolation, not cosmetic.

5. **Step 5: Estimate the cost of the candidate topology** — Calculate: average tokens per call × topology multiplier × call volume/day × price per token for the chosen model. If multi-agent cost is >2x single agent for the same result, you need explicit technical justification — not 'it's more organized'.

6. **Step 6: Define your verifier — and ensure it's not another LLM** — Every agentic system in production needs verification. But real verification is deterministic: unit tests on generated code, schema validation on produced JSON, sandbox execution, business rule checks in code. A second LLM 'reviewing' the first's output has correlated bias — that's not verification, it's quality theater.

## How Amazon Bedrock implements multi-agent collaboration — and where the limits matter

Amazon Bedrock supports multi-agent collaboration natively since 2024, with a supervisor model that can invoke sub-agents as if they were tools. The architecture is elegant: the supervisor agent receives the task, decides which sub-agent to invoke, passes relevant context, receives the result, and continues its reasoning. Each sub-agent has its own IAM execution role, its own knowledge bases, and its own action groups.

This cleanly solves the permissions isolation problem. A sub-agent accessing sensitive financial data can have a restricted role, while the supervisor operates with broader orchestration permissions. This is a legitimate use case for multi-agent — it's not about 'organization', it's about security boundaries.

What Bedrock doesn't automatically solve is the context problem. When the supervisor invokes a sub-agent, it needs to pass enough context for the sub-agent to understand what's being asked. This means relevant context is serialized, sent, processed, and the result returned. For long tasks, this can result in very large input contexts in sub-agents, increasing cost and latency.

For more complex orchestration — especially when you need retry, compensation, or long-running flows — AWS Step Functions is the natural complement. You can orchestrate Bedrock Agents calls via Step Functions, with full control over timeout, retry with exponential backoff, and durable state between steps. This is especially valuable in document processing pipelines where a step can fail and you need to resume from where you stopped, not restart from scratch.

An important note about Bedrock multi-agent limits: agent nesting depth has limits (check current documentation for region-specific limits). Very deep topologies — supervisor invoking sub-agent invoking another sub-agent — quickly become undebuggable and violate the simplicity principle. If you're thinking about 3+ levels of nesting, stop and redesign.

## Multi-agent topologies: visual decision map

The four possible topologies, their data flows, and decision points. The correct topology emerges from the problem, not from architectural preference.

### 👤 Entrada / Input

- Usuário User (user)

### 🟢 Topologia A: Agente Único / Single Agent

- Agente Único Single Agent (ReAct loop) (ai)
- Ferramentas Tools (Lambda / API / KB) (compute)

### 🔵 Topologia B: Supervisor / Supervisor

- Supervisor Agent (Bedrock) (ai)
- Sub-agente A Sub-agent A (domínio/role A) (ai)
- Sub-agente B Sub-agent B (domínio/role B) (ai)
- IAM Role A (permissões isoladas) (security)
- IAM Role B (permissões isoladas) (security)

### 🟡 Topologia C: Pipeline Sequencial / Sequential Pipeline

- Step Functions (orquestração durável) (messaging)
- Estágio 1 Extração / Extract (ai)
- Estágio 2 Análise / Analyze (ai)
- Estágio 3 Formatar / Format (ai)
- Verifier (código / schema não LLM) (compute)

### 🔴 Topologia D: Debate / Adversarial

- Gerador Generator (propõe solução) (ai)
- Crítico Critic (adversarial) (ai)
- Árbitro Judge (ou código) (ai)
- Verificação Code-based (obrigatório) (compute)

### 💾 Saída / Output

- Resposta Final Final Response (data)

### Flows

- user -> single_agent: task
- single_agent -> tools_single: tool call
- tools_single -> single_agent: result
- single_agent -> output
- user -> supervisor: complex task
- supervisor -> sub_a: delegates
- supervisor -> sub_b: delegates
- sub_a -> iam_a
- sub_b -> iam_b
- sub_a -> supervisor: result
- sub_b -> supervisor: result
- supervisor -> output
- user -> sfn: starts pipeline
- sfn -> stage1
- stage1 -> stage2
- stage2 -> stage3
- stage3 -> verifier: validates (code)
- verifier -> output
- user -> generator: high stakes
- generator -> critic: proposal
- critic -> judge: critique
- judge -> code_verify: real verification
- code_verify -> output

> **Anti-patterns that destroy multi-agent systems in production:** **1. Premature multi-agent for flexibility ('let's separate it to keep things organized'):** Architectural organization is not justification for multiple agents. You don't split microservices for aesthetics — you split for domain boundaries, independent scaling, or failure isolation. The same criterion applies to agents. If the only argument is 'it's cleaner', you're paying 2-4x more for a problem a good system prompt solves.

**2. Self-evaluation as verifier ('agent B will review agent A's output'):** This is the most dangerous and most common anti-pattern. Two agents based on the same base model have correlated biases — they tend to make the same types of errors and approve the same types of incorrect responses. This is not verification: it's bias talking to bias. Real verification is deterministic: execute the generated code, validate JSON against a schema, check business rules in Python code. If you can't verify deterministically, be honest about the output's confidence level.

**3. Loops without a clear exit condition:** Agents in a loop without explicit and testable stopping criteria will iterate until timeout or until consuming their token budget. Always define: maximum number of iterations, verifiable success criterion, and what happens when neither is reached.

**4. Ignoring Bedrock traces in development:** The Bedrock Agents reasoning trace is your only window into what's happening inside the loop. Developing without inspecting the trace is like debugging code without logs. Always enable in development and have a plan for sampling in production.

> **Rule of thumb: complexity is not sophistication:** **If you can't explain why you need N agents instead of N-1 tools, you don't need N agents.**

More specifically: add an agent only when one of these three conditions is true:
1. **Real security isolation** — the sub-agent needs permissions the main agent cannot have.
2. **Genuine parallelism** — sub-tasks are independent and the time gain justifies the cost overhead.
3. **Adversarial roles by design** — generator and critic need explicitly conflicting objectives.

Everything else is a tool.

> **My perspective: what I actually do in practice:** When I start an agentic project, my default position is a single agent with as many tools as the use case requires. I only move to multi-agent when I hit one of the three criteria above in a demonstrable way — not theoretical.

The pattern I see most often in projects that come to me with problems is what I call 'pride architecture': someone built a system with 5 agents because it looks more impressive on a diagram. The system has 3x the cost, 4x the latency, and nobody can debug when something goes wrong. Refactoring to a single agent with tools usually solves the problem in an afternoon.

For verification, I never trust LLM-as-judge for anything that matters. If I'm generating code, I execute it in a Lambda sandbox and verify the result. If I'm generating structured JSON, I validate against a Pydantic schema before any downstream. If I'm generating financial analysis, I have sanity rules in code that check expected ranges. This isn't distrust of the model — it's basic systems engineering.

Amazon Bedrock multi-agent collaboration is a well-implemented feature for cases where multi-agent is genuinely necessary. IAM role isolation is genuinely useful for compliance. But the feature existing doesn't mean you should use it. Most systems I build on Bedrock use a single agent with knowledge bases and action groups — and it works very well.

## Verdict

Multi-agent is an engineering solution for specific engineering problems — not a design philosophy. Every agent you add is a bet: you're betting that the gain in isolation, parallelism, or adversarial specialization outweighs the cost in tokens, latency, and operational complexity. In most cases, that bet loses.

The criterion isn't 'is the system complex?' — it's 'does the system's complexity require roles that cannot coexist in a single context?' If the answer is no, you have an agent with tools. If the answer is yes, you have a legitimate case for multi-agent — and now you know which topology to choose and why.

Real sophistication in agentic systems isn't in the number of agents. It's in the clarity of the objective, the quality of the tools, the robustness of verification, and the honesty about what the system can and cannot do with confidence.

## References

- [AWS — Multi-agent collaboration on Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/agents-multi-agent-collaboration.html)
- [Anthropic — Building effective agents](https://www.anthropic.com/engineering/building-effective-agents)
- [AWS Step Functions](https://aws.amazon.com/step-functions/)

## Case sources

- [AWS — Multi-agent collaboration on Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/agents-multi-agent-collaboration.html)
- [Anthropic — Building effective agents](https://www.anthropic.com/engineering/building-effective-agents)
- [AWS Step Functions](https://aws.amazon.com/step-functions/)
