Inside AI Agents (2/3): Architecture Pattern Catalog — from ReAct to Multi-Agent
Listen to study
generated on playGenerated only on first play
Powered by Amazon Polly + OmniVoice
The second lesson in the series maps the full catalog of AI agent architecture patterns: from single-agent loops (ReAct, Reflexion, Plan-and-Execute) to multi-agent orchestration, covering memory as an architecture decision, guardrails, and human-in-the-loop. The goal is to give the architect a precise vocabulary to choose — and justify — the right pattern for each problem, without falling into classic anti-patterns.
In Part 1 of this series you understood what an AI agent is: a loop that perceives, reasons, acts, and observes results — different from a simple prompt pipeline. Now comes the question that really matters for system designers: which pattern to use? ReAct? Plan-and-Execute? Multi-agent with a supervisor? The wrong answer costs money (tokens are expensive), latency (chained loops add up in seconds), and reliability (every hop is a failure point). This guide is the catalog I wish I had when I started designing agents for production: each pattern explained from scratch, with when to use it, when to avoid it, and the anti-patterns that show up every time.
What you will learn
Quick Glossary — Terms that appear in this lesson
- LLM
- Large Language Model — the language model that does the agent's core reasoning (e.g. Claude, GPT-4, Titan).
- Tool / Function
- External function the agent can call: REST API, SQL query, vector search, calculator, etc.
- Context Window
- Token limit the LLM can 'see' at once. Everything outside the window is invisible to the model — like process RAM.
- RAG
- Retrieval-Augmented Generation — fetching relevant documents and injecting them into the prompt before generating a response.
- Guardrail
- Validation layer that filters agent inputs/outputs — analogous to a WAF for LLMs.
- Prompt Injection
- Attack where external data (e.g. email content) contains malicious instructions that hijack the agent's behavior.
- Orchestrator / Supervisor
- Agent or process that breaks down tasks and delegates to specialized sub-agents.
- TTL (Time-to-Live)
- Expiry time for data in cache/memory. Without TTL, agent memory grows indefinitely.
The Fundamental Loop: why every pattern starts here
If you read Part 1, you know that an agent is essentially a loop: the LLM receives an observation, reasons, decides on an action, executes it, observes the result, and repeats. What differentiates architectural patterns is not the loop itself — it is how the reasoning is structured inside it and how many agents participate.
Think like a software developer. You have written a while with a stop condition. An agent is exactly that, but the stop condition is decided by the LLM ('is the task complete?'), and the loop body can call external tools. The engineering problem is: how do you structure the reasoning inside the loop so that it is reliable, auditable, and does not enter an infinite loop?
ReAct (Reasoning + Acting, Yao et al., 2022) was the first pattern to formalize this: the model explicitly alternates between a Thought step (natural language reasoning), an Action step (tool call), and an Observation step (tool result). This explicit alternation is powerful because it makes reasoning auditable — you can read the log and understand why the agent made each decision. It is the equivalent of writing code with inline comments instead of a monolithic block.
Diagram 1 — Single-Agent ReAct with Tools
A single LLM agent in a ReAct loop (Thought → Action → Observation). Tools are synchronous calls; the guardrail inspects input and output. The loop ends when the LLM emits 'Final Answer' or the step budget is exhausted.
- User · request
- Input Guardrail · prompt injection / PII filter
- Output Guardrail · toxicity / data leak filter
- LLM · Claude / GPT-4 / Titan
- Thought · reasoning step
- Action · tool selection + args
- Observation · tool result injected
- Vector Search · Agentic RAG
- External API · REST / GraphQL
- Calculator · deterministic fn
- Context Window · in-prompt (ephemeral)
- Session Memory · short-term store
Beyond ReAct: Reflexion, Plan-and-Execute, and Agentic RAG
Reflexion (Shinn et al., 2023) adds a second inner loop of self-critique: after each attempt, the agent generates a reflection on what went wrong and stores that reflection in session memory before trying again. It is like a developer who writes a test, sees it fail, notes the diagnosis, and then fixes it — instead of simply trying again in the dark. The cost is clear: more tokens per attempt. Use it when the task has a verifiable correctness criterion (e.g. code that must compile, SQL that must return a non-empty result).
Plan-and-Execute separates reasoning into two distinct phases: a Planner LLM generates a step-by-step plan (think of a Makefile), and an Executor LLM executes each step independently. The advantage is that the plan can be inspected and approved by a human before execution — excellent for high-risk flows. The disadvantage is rigidity: if the environment changes during execution, the plan may become stale.
Agentic RAG is the upgrade from static RAG that every architect should know. In traditional RAG, you always fetch documents before generating. In Agentic RAG, the agent decides when to fetch, formulates the search query as part of its reasoning, and can perform multiple iterative searches ('I did not find it, I will reformulate the query'). This reduces noise in the context — you do not inject irrelevant documents — but requires the agent to know when it does not know something, which is a model calibration problem.
Single-Agent Patterns — When to use each one
| Pattern | Code analogy | Best for | Main cost | Avoid when | |
|---|---|---|---|---|---|
| ReAct | while loop with log | Exploratory tasks with tools | Tokens per iteration | Task has a direct answer without tools | — |
| Reflexion | TDD with self-critique | Code generation, SQL, verifiable logic | 2-3x more tokens per attempt | No clear correctness criterion | — |
| Plan-and-Execute | Makefile + executor | Long flows with human approval | Planning latency + rigidity | Environment changes during execution | — |
| Tool Use / Function Calling | SDK with typed methods | Integration with external APIs | Tool I/O latency | More than 15-20 tools (tool sprawl) | — |
| Agentic RAG | Lazy loading of knowledge | Large, heterogeneous knowledge bases | Model calibration to know when to search | Small, stable knowledge base (static RAG suffices) | — |
Memory as Architecture: the problem nobody models correctly
Memory in agents is analogous to storage in microservices: you have RAM (fast, expensive, volatile), local disk (cheaper, session-persistent), and distributed database (slow, cheap, shared). Each layer has a different trade-off.
The context window is the agent's RAM: everything in the active prompt. It is the fastest memory, but has linear token cost — every extra token you inject into the context increases inference cost and latency. The classic mistake is to dump everything into the context window ('I will include the full conversation history') and discover that the API bill tripled.
Session memory (short-term) is the equivalent of a Redis cache per session: you persist the current conversation history outside the prompt and inject only a summary or the last N turns. This solves the cost problem, but requires a summarization strategy — and LLM-based summarization has a cost too.
Long-term memory — vector, episodic, semantic — is the distributed database. You store facts, user preferences, past episodes in a vector store (e.g. OpenSearch, pgvector) and retrieve by semantic similarity when relevant. The risk here is twofold: memory without TTL grows indefinitely (storage cost and search latency), and stale memory generates hallucinations — the agent 'remembers' a fact that has changed.
Diagram 2 — Agent Memory Layers
Three memory layers with increasing cost and latency. The agent chooses which layer to consult/write at each step of the ReAct loop. The 'summarization' arrow shows that the context window can be compressed into session memory.
- Agent LLM · ReAct loop
- Context Window · ~200k tokens max · ephemeral, costly
- Session Store · Redis / DynamoDB · TTL = session lifetime
- Summarizer · LLM compression
- Vector Store · OpenSearch / pgvector · semantic retrieval
- Episodic Store · structured facts · user preferences
- TTL Policy · stale-data eviction
Multi-Agent: when it truly helps and when it only adds complexity
The most important question about multi-agent is not 'which topology to use' — it is 'do I really need multi-agent?' The honest answer is: in most cases, no. A well-designed single agent with the right tools solves 80% of the problems people try to solve with complex multi-agent architectures.
Multi-agent makes sense in three real scenarios: (1) genuine parallelism — independent sub-tasks that can run simultaneously (e.g. analyzing 10 documents in parallel); (2) domain specialization — a sub-agent with a system prompt and tools optimized for a specific domain (e.g. a compliance agent vs. a financial analysis agent); (3) context isolation — you do not want the context of one sub-task to contaminate the reasoning of another.
The four main topologies are: Supervisor/Orchestrator (a central agent delegates to sub-agents and consolidates results — like a tech lead distributing tasks); Hierarchical (supervisor of supervisors — for very large problems); Swarm/Peers (agents without hierarchy that communicate via messages — more resilient, harder to debug); and Specialist Routing (a router LLM classifies the task and sends it to the correct specialist agent — analogous to an API Gateway with semantic routing).
The decision criterion I use in production: if you cannot explain in one sentence why each agent exists separately, you probably do not need multi-agent.
Diagram 3 — Supervisor → Specialist Sub-Agents Topology
The orchestrator receives the user task, decomposes it into sub-tasks, delegates to specialized agents (with their own tools and memory), and consolidates results. Guardrails operate at each boundary. Human-in-the-loop is a checkpoint before high-risk action.
- User · request
- Entry Guardrail · injection / scope check
- Supervisor LLM · task decomposition · + result consolidation
- Task Router · which sub-agent?
- Compliance Agent · regulatory tools · narrow context
- Finance Agent · data / calc tools · narrow context
- Document Agent · Agentic RAG · vector store
- Human Checkpoint · approve / reject · high-risk action
- Exit Guardrail · output validation · data leak check
Guardrails, Security, and Human-in-the-Loop: architecture, not afterthought
The most common mistake I see in teams coming to AI agents is treating security as a layer added afterwards. That does not work. Guardrails need to be designed as part of the flow from the start, because they affect latency, cost, and tool design.
Prompt injection is the SQL injection of agents: external data (an email, a document, an API response) can contain instructions that hijack the agent's behavior. The defense is layered: (1) clearly separate data from instructions in the prompt (use explicit delimiters); (2) validate in the input guardrail whether data content contains instruction patterns; (3) apply the principle of least privilege to tools — an agent that only needs to read documents should not have a database write tool.
Tool sprawl is the anti-pattern of giving the agent 30 tools 'because it might need them'. The LLM must choose the right tool from all available ones — the more tools, the higher the probability of wrong selection and the larger the system prompt. Practical rule: if an agent has more than 15 tools, consider splitting into specialized sub-agents.
Human-in-the-loop is not a sign of architectural weakness — it is a business requirement in any system that executes irreversible actions (financial transfers, mass email sending, data deletion). The correct pattern is to model explicit checkpoints in the Plan-and-Execute flow: the plan is generated, a human approves, execution begins. This is also what Amazon Bedrock Agents supports natively with 'human approval steps'.
Classic Agent Anti-Patterns — Symptom, Cause, and Remedy
| Anti-Pattern | Symptom | Root cause | Remedy | |
|---|---|---|---|---|
| Premature multi-agent | High latency, impossible debugging | Complexity before real need | Start single-agent; add agents only with clear justification | — |
| Tool Sprawl | Agent picks wrong tool, random errors | More than 15-20 tools per agent | Split into specialized sub-agents with minimal tools | — |
| Memory without TTL | Growing cost, hallucinations from stale facts | No expiration policy on vector store | TTL per data type; re-index facts that change | — |
| Infinite loop | Timeout, exploding cost, no response | No step budget | Set max_iterations; explicit fallback to 'I do not know' | — |
| Polluted context | Inconsistent responses, high cost | Full history always in prompt | Summarization + session memory; inject only what is relevant | — |
Where to start — Mental checklist for the architect
If you come from distributed systems, the biggest trap when designing agents is the instinct to add complexity to solve reliability problems. In microservices, when something fails, you add retry, circuit breaker, saga. In agents, the answer is almost always: simplify the loop, reduce the tools, improve the prompt. Multi-agent is powerful, but it is the equivalent of microservices — you pay the coordination cost in latency, tokens, and operational complexity. I only add a second agent when I can measure the concrete benefit: latency reduction through parallelism, or measurable quality improvement through specialization. What convinced me to take guardrails seriously was not theory — it was watching an agent in staging get hijacked by a prompt injection hidden in the body of an email it was processing. Part 3 of this series will show how Bedrock AgentCore handles these problems with managed infrastructure, but the principles in this lesson apply to any stack.
Verdict — What to take away
The pattern catalog in this lesson is not a list of equivalent options — it is a staircase of complexity. ReAct is step 1: it solves most problems, is auditable, and cheap to operate. Reflexion and Plan-and-Execute are steps 2 and 3: they add self-correction capability and human approval, but at the cost of tokens and latency. Multi-agent is step 4: only climb there when the previous steps fail for measurable reasons. Memory and guardrails are not steps — they are foundations that must be present from step 1. Part 3 of this series closes the loop by showing how Amazon Bedrock AgentCore implements these patterns in managed infrastructure, with native support for memory, guardrails, and human-in-the-loop.
References
Post-mortems, ADRs and architecture deep dives in your inbox — the way an architect reads them.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this study from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.