# Inside AI Agents (2/3): Architecture Pattern Catalog — from ReAct to Multi-Agent

The second lesson in the series maps the full catalog of AI agent architecture patterns: from single-agent loops (ReAct, Reflexion, Plan-and-Execute) to multi-agent orchestration, covering memory as an architecture decision, guardrails, and human-in-the-loop. The goal is to give the architect a precise vocabulary to choose — and justify — the right pattern for each problem, without falling into classic anti-patterns.

- URL: https://fernando.moretes.com/studies/agentes-de-ia-por-dentro-2-padroes-de-arquitetura

- Markdown: https://fernando.moretes.com/studies/agentes-de-ia-por-dentro-2-padroes-de-arquitetura/study.md?lang=en

- Type: Guide / Deep Dive

- Domain: IA / Agentes

- Date: 2026-06-26

- Tags: ai-agents, ReAct, multi-agent, memory, guardrails, orchestration, LLM, architecture-patterns

- Reading time: 8 min

---

In Part 1 of this series you understood what an AI agent is: a loop that perceives, reasons, acts, and observes results — different from a simple prompt pipeline. Now comes the question that really matters for system designers: *which pattern to use?* ReAct? Plan-and-Execute? Multi-agent with a supervisor? The wrong answer costs money (tokens are expensive), latency (chained loops add up in seconds), and reliability (every hop is a failure point). This guide is the catalog I wish I had when I started designing agents for production: each pattern explained from scratch, with when to use it, when to avoid it, and the anti-patterns that show up every time.

## What you will learn

- The 5 single-agent patterns (ReAct, Reflexion, Plan-and-Execute, Tool Use, Agentic RAG) and when each one fits
- Memory as an architecture decision: context window, session, long-term — and why poorly designed memory becomes cost and hallucination
- The 4 multi-agent topologies (supervisor, hierarchical, swarm, specialist routing) and the real criterion for choosing
- Guardrails and security as an architecture layer — not an afterthought
- Human-in-the-loop: when to require human approval and how to model checkpoints without blocking the flow
- Classic anti-patterns: premature multi-agent, tool sprawl, memory without TTL, infinite loop

## Quick Glossary — Terms that appear in this lesson

- **LLM:** Large Language Model — the language model that does the agent's core reasoning (e.g. Claude, GPT-4, Titan).
- **Tool / Function:** External function the agent can call: REST API, SQL query, vector search, calculator, etc.
- **Context Window:** Token limit the LLM can 'see' at once. Everything outside the window is invisible to the model — like process RAM.
- **RAG:** Retrieval-Augmented Generation — fetching relevant documents and injecting them into the prompt before generating a response.
- **Guardrail:** Validation layer that filters agent inputs/outputs — analogous to a WAF for LLMs.
- **Prompt Injection:** Attack where external data (e.g. email content) contains malicious instructions that hijack the agent's behavior.
- **Orchestrator / Supervisor:** Agent or process that breaks down tasks and delegates to specialized sub-agents.
- **TTL (Time-to-Live):** Expiry time for data in cache/memory. Without TTL, agent memory grows indefinitely.

## The Fundamental Loop: why every pattern starts here

If you read Part 1, you know that an agent is essentially a loop: the LLM receives an observation, reasons, decides on an action, executes it, observes the result, and repeats. What differentiates architectural patterns is not the loop itself — it is *how* the reasoning is structured inside it and *how many* agents participate.

Think like a software developer. You have written a `while` with a stop condition. An agent is exactly that, but the stop condition is decided by the LLM ('is the task complete?'), and the loop body can call external tools. The engineering problem is: how do you structure the reasoning inside the loop so that it is reliable, auditable, and does not enter an infinite loop?

**ReAct** (Reasoning + Acting, Yao et al., 2022) was the first pattern to formalize this: the model explicitly alternates between a *Thought* step (natural language reasoning), an *Action* step (tool call), and an *Observation* step (tool result). This explicit alternation is powerful because it makes reasoning auditable — you can read the log and understand *why* the agent made each decision. It is the equivalent of writing code with inline comments instead of a monolithic block.

## Diagram 1 — Single-Agent ReAct with Tools

A single LLM agent in a ReAct loop (Thought → Action → Observation). Tools are synchronous calls; the guardrail inspects input and output. The loop ends when the LLM emits 'Final Answer' or the step budget is exhausted.

### 👤 User / Client

- User request (user)

### 🛡️ Security Layer

- Input Guardrail prompt injection / PII filter (security)
- Output Guardrail toxicity / data leak filter (security)

### 🤖 Agent Core (ReAct Loop)

- LLM Claude / GPT-4 / Titan (ai)
- Thought reasoning step (ai)
- Action tool selection + args (ai)
- Observation tool result injected (ai)

### 🔧 Tools

- Vector Search Agentic RAG (data)
- External API REST / GraphQL (external)
- Calculator deterministic fn (compute)

### 💾 Memory

- Context Window in-prompt (ephemeral) (storage)
- Session Memory short-term store (storage)

### Flows

- user -> guardrail_in: raw input
- guardrail_in -> llm: validated input
- llm -> thought: 1. Thought
- thought -> action: 2. Action
- action -> tool_search: tool call
- action -> tool_api: tool call
- action -> tool_calc: tool call
- tool_search -> observation: result
- tool_api -> observation: result
- tool_calc -> observation: result
- observation -> llm: 3. loop back
- llm -> guardrail_out: Final Answer
- guardrail_out -> user: filtered response
- llm -> ctx_window: read/write
- llm -> session_mem: persist

## Beyond ReAct: Reflexion, Plan-and-Execute, and Agentic RAG

**Reflexion** (Shinn et al., 2023) adds a second inner loop of self-critique: after each attempt, the agent generates a reflection on what went wrong and stores that reflection in session memory before trying again. It is like a developer who writes a test, sees it fail, notes the diagnosis, and then fixes it — instead of simply trying again in the dark. The cost is clear: more tokens per attempt. Use it when the task has a verifiable correctness criterion (e.g. code that must compile, SQL that must return a non-empty result).

**Plan-and-Execute** separates reasoning into two distinct phases: a *Planner* LLM generates a step-by-step plan (think of a `Makefile`), and an *Executor* LLM executes each step independently. The advantage is that the plan can be inspected and approved by a human before execution — excellent for high-risk flows. The disadvantage is rigidity: if the environment changes during execution, the plan may become stale.

**Agentic RAG** is the upgrade from static RAG that every architect should know. In traditional RAG, you always fetch documents before generating. In Agentic RAG, the agent *decides* when to fetch, *formulates* the search query as part of its reasoning, and can perform multiple iterative searches ('I did not find it, I will reformulate the query'). This reduces noise in the context — you do not inject irrelevant documents — but requires the agent to know when it *does not know* something, which is a model calibration problem.

## Single-Agent Patterns — When to use each one
| Criterion | Pattern | Code analogy | Best for | Main cost | Avoid when |
| --- | --- | --- | --- | --- | --- |
| ReAct | while loop with log | Exploratory tasks with tools | Tokens per iteration | Task has a direct answer without tools | — |
| Reflexion | TDD with self-critique | Code generation, SQL, verifiable logic | 2-3x more tokens per attempt | No clear correctness criterion | — |
| Plan-and-Execute | Makefile + executor | Long flows with human approval | Planning latency + rigidity | Environment changes during execution | — |
| Tool Use / Function Calling | SDK with typed methods | Integration with external APIs | Tool I/O latency | More than 15-20 tools (tool sprawl) | — |
| Agentic RAG | Lazy loading of knowledge | Large, heterogeneous knowledge bases | Model calibration to know when to search | Small, stable knowledge base (static RAG suffices) | — |

## Memory as Architecture: the problem nobody models correctly

Memory in agents is analogous to storage in microservices: you have RAM (fast, expensive, volatile), local disk (cheaper, session-persistent), and distributed database (slow, cheap, shared). Each layer has a different trade-off.

The **context window** is the agent's RAM: everything in the active prompt. It is the fastest memory, but has linear token cost — every extra token you inject into the context increases inference cost and latency. The classic mistake is to dump everything into the context window ('I will include the full conversation history') and discover that the API bill tripled.

**Session memory** (short-term) is the equivalent of a Redis cache per session: you persist the current conversation history outside the prompt and inject only a summary or the last N turns. This solves the cost problem, but requires a summarization strategy — and LLM-based summarization has a cost too.

**Long-term memory** — vector, episodic, semantic — is the distributed database. You store facts, user preferences, past episodes in a vector store (e.g. OpenSearch, pgvector) and retrieve by semantic similarity when relevant. The risk here is twofold: memory without TTL grows indefinitely (storage cost and search latency), and stale memory generates hallucinations — the agent 'remembers' a fact that has changed.

## Diagram 2 — Agent Memory Layers

Three memory layers with increasing cost and latency. The agent chooses which layer to consult/write at each step of the ReAct loop. The 'summarization' arrow shows that the context window can be compressed into session memory.

### 🤖 Agent LLM

- Agent LLM ReAct loop (ai)

### ⚡ Layer 1 — Context Window (RAM)

- Context Window ~200k tokens max ephemeral, costly (storage)

### 🗂️ Layer 2 — Session Memory (Cache)

- Session Store Redis / DynamoDB TTL = session lifetime (storage)
- Summarizer LLM compression (ai)

### 🗄️ Layer 3 — Long-Term Memory (DB)

- Vector Store OpenSearch / pgvector semantic retrieval (data)
- Episodic Store structured facts user preferences (data)
- TTL Policy stale-data eviction (security)

### Flows

- agent_core -> ctx: read/write (sync)
- ctx -> summarizer: compress when > threshold
- summarizer -> session: persist summary
- agent_core -> session: retrieve history
- agent_core -> vector_store: semantic search
- agent_core -> episodic: read facts / preferences
- ttl_policy -> vector_store: evict stale
- ttl_policy -> episodic: evict stale

## Multi-Agent: when it truly helps and when it only adds complexity

The most important question about multi-agent is not 'which topology to use' — it is 'do I really need multi-agent?' The honest answer is: in most cases, no. A well-designed single agent with the right tools solves 80% of the problems people try to solve with complex multi-agent architectures.

Multi-agent makes sense in three real scenarios: (1) **genuine parallelism** — independent sub-tasks that can run simultaneously (e.g. analyzing 10 documents in parallel); (2) **domain specialization** — a sub-agent with a system prompt and tools optimized for a specific domain (e.g. a compliance agent vs. a financial analysis agent); (3) **context isolation** — you do not want the context of one sub-task to contaminate the reasoning of another.

The four main topologies are: **Supervisor/Orchestrator** (a central agent delegates to sub-agents and consolidates results — like a tech lead distributing tasks); **Hierarchical** (supervisor of supervisors — for very large problems); **Swarm/Peers** (agents without hierarchy that communicate via messages — more resilient, harder to debug); and **Specialist Routing** (a router LLM classifies the task and sends it to the correct specialist agent — analogous to an API Gateway with semantic routing).

The decision criterion I use in production: if you cannot explain in one sentence *why* each agent exists separately, you probably do not need multi-agent.

## Diagram 3 — Supervisor → Specialist Sub-Agents Topology

The orchestrator receives the user task, decomposes it into sub-tasks, delegates to specialized agents (with their own tools and memory), and consolidates results. Guardrails operate at each boundary. Human-in-the-loop is a checkpoint before high-risk action.

### 👤 User / Client

- User request (user)

### 🛡️ Entry Guardrail

- Entry Guardrail injection / scope check (security)

### 🧠 Orchestrator Agent

- Supervisor LLM task decomposition + result consolidation (ai)
- Task Router which sub-agent? (ai)

### 🔬 Specialist Sub-Agents

- Compliance Agent regulatory tools narrow context (ai)
- Finance Agent data / calc tools narrow context (ai)
- Document Agent Agentic RAG vector store (ai)

### 🧑‍💼 Human-in-the-Loop

- Human Checkpoint approve / reject high-risk action (user)

### 🛡️ Exit Guardrail

- Exit Guardrail output validation data leak check (security)

### Flows

- u2 -> g_entry: raw task
- g_entry -> orch: validated input
- orch -> router: decomposes
- router -> sa_compliance: delegate (parallel)
- router -> sa_finance: delegate (parallel)
- router -> sa_docs: delegate (parallel)
- sa_compliance -> orch: partial result
- sa_finance -> orch: partial result
- sa_docs -> orch: partial result
- orch -> hitl: checkpoint (risky action)
- hitl -> orch: approved / rejected
- orch -> g_exit: consolidated response
- g_exit -> u2: final response

## Guardrails, Security, and Human-in-the-Loop: architecture, not afterthought

The most common mistake I see in teams coming to AI agents is treating security as a layer added afterwards. That does not work. Guardrails need to be designed as part of the flow from the start, because they affect latency, cost, and tool design.

**Prompt injection** is the SQL injection of agents: external data (an email, a document, an API response) can contain instructions that hijack the agent's behavior. The defense is layered: (1) clearly separate data from instructions in the prompt (use explicit delimiters); (2) validate in the input guardrail whether data content contains instruction patterns; (3) apply the **principle of least privilege** to tools — an agent that only needs to read documents should not have a database write tool.

**Tool sprawl** is the anti-pattern of giving the agent 30 tools 'because it might need them'. The LLM must choose the right tool from all available ones — the more tools, the higher the probability of wrong selection and the larger the system prompt. Practical rule: if an agent has more than 15 tools, consider splitting into specialized sub-agents.

**Human-in-the-loop** is not a sign of architectural weakness — it is a business requirement in any system that executes irreversible actions (financial transfers, mass email sending, data deletion). The correct pattern is to model explicit checkpoints in the Plan-and-Execute flow: the plan is generated, a human approves, execution begins. This is also what Amazon Bedrock Agents supports natively with 'human approval steps'.

## Classic Agent Anti-Patterns — Symptom, Cause, and Remedy
| Criterion | Anti-Pattern | Symptom | Root cause | Remedy |
| --- | --- | --- | --- | --- |
| Premature multi-agent | High latency, impossible debugging | Complexity before real need | Start single-agent; add agents only with clear justification | — |
| Tool Sprawl | Agent picks wrong tool, random errors | More than 15-20 tools per agent | Split into specialized sub-agents with minimal tools | — |
| Memory without TTL | Growing cost, hallucinations from stale facts | No expiration policy on vector store | TTL per data type; re-index facts that change | — |
| Infinite loop | Timeout, exploding cost, no response | No step budget | Set max_iterations; explicit fallback to 'I do not know' | — |
| Polluted context | Inconsistent responses, high cost | Full history always in prompt | Summarization + session memory; inject only what is relevant | — |

## Where to start — Mental checklist for the architect

- ✅ Start with the simplest pattern that solves the problem: a ReAct agent with 3-5 tools solves most cases.
- ✅ Define the step budget (max_iterations) before going to production — without it, you are guaranteed an infinite loop.
- ✅ Model memory layers explicitly: what goes in the context window, what goes in session, what goes in the vector store — and what the TTL is for each.
- ✅ Input and output guardrails are mandatory in production — even if simple at first. Add prompt injection detection from day 1.
- ✅ Before adding a second agent, write in one sentence why it exists separately. If you cannot, do not add it.
- ✅ Identify the irreversible actions in your system and add human-in-the-loop checkpoints for each of them.

> **Architect's Perspective — For those making this transition:** If you come from distributed systems, the biggest trap when designing agents is the instinct to add complexity to solve reliability problems. In microservices, when something fails, you add retry, circuit breaker, saga. In agents, the answer is almost always: simplify the loop, reduce the tools, improve the prompt. Multi-agent is powerful, but it is the equivalent of microservices — you pay the coordination cost in latency, tokens, and operational complexity. I only add a second agent when I can measure the concrete benefit: latency reduction through parallelism, or measurable quality improvement through specialization. What convinced me to take guardrails seriously was not theory — it was watching an agent in staging get hijacked by a prompt injection hidden in the body of an email it was processing. Part 3 of this series will show how Bedrock AgentCore handles these problems with managed infrastructure, but the principles in this lesson apply to any stack.

## Verdict — What to take away

The pattern catalog in this lesson is not a list of equivalent options — it is a staircase of complexity. ReAct is step 1: it solves most problems, is auditable, and cheap to operate. Reflexion and Plan-and-Execute are steps 2 and 3: they add self-correction capability and human approval, but at the cost of tokens and latency. Multi-agent is step 4: only climb there when the previous steps fail for measurable reasons. Memory and guardrails are not steps — they are foundations that must be present from step 1. Part 3 of this series closes the loop by showing how Amazon Bedrock AgentCore implements these patterns in managed infrastructure, with native support for memory, guardrails, and human-in-the-loop.

## References

- [Anthropic — Building effective agents](https://www.anthropic.com/engineering/building-effective-agents)
- [AWS — Multi-agent orchestration on Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/agents-multi-agent-collaboration.html)
- [Yao et al. — ReAct: Synergizing Reasoning and Acting in Language Models (arXiv 2210.03629)](https://arxiv.org/abs/2210.03629)
- [Shinn et al. — Reflexion: Language Agents with Verbal Reinforcement Learning (arXiv 2303.11366)](https://arxiv.org/abs/2303.11366)
- [Gregor Hohpe — The Architecture Elevator (book)](https://architectureelevator.com/)
- [AWS — Amazon Bedrock Agents documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html)

## Case sources

- [Anthropic — Building effective agents](https://www.anthropic.com/engineering/building-effective-agents)
- [AWS — Multi-agent orchestration on Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/agents-multi-agent-collaboration.html)
- [Yao et al. — ReAct](https://arxiv.org/abs/2210.03629)
- [Shinn et al. — Reflexion](https://arxiv.org/abs/2303.11366)