# Playbook: GenAI FinOps — Cutting Agent Cost Without Losing Quality

GenAI agents in production accumulate cost in places standard dashboards don't surface: bloated input tokens, loops without stop criteria, silent retries, and expensive models on trivial tasks. This playbook presents six optimization levers — with execution order, impact table, and the anti-patterns that turn savings into quality regressions.

- URL: https://fernando.moretes.com/studies/playbook-finops-de-genai-cortar-custo-de-agente

- Markdown: https://fernando.moretes.com/studies/playbook-finops-de-genai-cortar-custo-de-agente/study.md?lang=en

- Type: Playbook

- Domain: IA / FinOps

- Date: 2026-03-25

- Tags: finops, genai, bedrock, agents, cost-optimization, llm, prompt-engineering, aws

- Reading time: 10 min

---

The CFO is now in the room. Your GenAI agent works — but the Bedrock bill grew faster than usage justifies, and nobody knows exactly why. The problem is rarely the model: it's the design around it. This playbook gives you the right levers, in the right order, without sacrificing quality.

## What you'll be able to decide after this playbook

- Where to instrument before any optimization (tokens per route, per model, per loop iteration)
- Which lever to attack first based on your agent's cost profile
- How to route simple tasks to cheaper models without quality regression
- When to use prompt caching and when it doesn't help
- Why silent retry is the most dangerous hidden cost in agents
- How structured output and stop rules cut cost without touching the model

## Playbook Context

- **Type:** Generic playbook — applicable to any agent on Amazon Bedrock
- **Domain:** GenAI FinOps / LLM cost optimization
- **Reference stack:** Amazon Bedrock (Agents, Prompt Caching, Batch Inference), CloudWatch, Cost Explorer
- **Models covered:** Claude 3 Haiku / Sonnet / Opus, Titan, Llama 3 — any model priced per input/output token
- **Critical premise:** You can't cut cost you don't measure. Instrumentation comes before optimization.
- **Savings numbers:** All percentages cited are [illustrative example] — not real case data

## The mental model that unlocks everything: you pay for every token that enters the loop

Most engineers think of an agent's cost as `(input tokens + output tokens) × price/token`. That's correct for a single call. But an agent doesn't make one call — it runs a **loop**: plan, call tool, observe result, re-plan, call tool again, generate final response. Each loop iteration receives as input the accumulated history of all previous iterations. The context grows. The cost grows quadratically with the number of iterations if you don't control what enters.

Add to this the **silent retry**: when the model returns malformed JSON, when a tool call fails, when parsing breaks — the orchestrator tries again. Without observability, every retry looks like a normal call in billing. In agents with long loops and unstable tools, retries can represent 20–40% of total cost [illustrative example].

The second blind spot is the **uniform model**: teams that ship an agent use Claude Sonnet (or equivalent mid-tier) at every step because that's what worked in the prototype. In production, 60–70% of calls are intent classification, field extraction, routing — tasks a Haiku-class model resolves with equivalent quality at a fraction of the cost [illustrative example]. Using the same model for complex reasoning and for 'what's the sentiment of this sentence?' is structural waste.

The premise of this playbook is simple: **quality almost never lives in the most expensive model — it lives in the design**. Well-constructed context, structured output, explicit stop criteria, and complexity-based routing deliver more result per dollar than upgrading Haiku to Opus across the board.

## Instrumentation: the non-negotiable prerequisite

Before touching any optimization lever, you need to know **where the money is going**. Amazon Bedrock exposes token metrics via CloudWatch (`InputTokenCount`, `OutputTokenCount` per model ID), but by default they have no route or loop-iteration dimension. You need to add that.

**What to instrument, at minimum:**

1. **Tokens per business route** — not just 'how much the agent consumed', but 'how much the customer support route consumed vs. the document analysis route'. This reveals which flow is out of control.

2. **Tokens per loop iteration** — structured log with `step_index`, `model_id`, `input_tokens`, `output_tokens`, `tool_called`, `is_retry`. Without this, you can't distinguish legitimate cost from retry cost.

3. **Accumulated context tokens** — the size of the context in the input of each call. If this number grows monotonically throughout the loop, you have a context window problem that will explode in production.

4. **Retry rate per tool** — how many times each tool call failed and was re-executed. High-failure-rate tools are cost multipliers.

**Where to emit:** CloudWatch Embedded Metric Format (EMF) is the simplest option in the AWS ecosystem — you emit structured JSON from Lambda/ECS and it becomes a metric and log simultaneously, with no extra agent. For ad-hoc analysis, push the same events to S3 + Athena.

Only after having this baseline for 5–7 days of real traffic do you have enough data to prioritize the levers below. Optimizing without measuring is trading one problem for another.

## The 6 optimization levers — in recommended execution order

1. **Lever 1 — Instrumentation (prerequisite)** — Emit tokens per route, per iteration, per model via CloudWatch EMF. Add `is_retry: true` to every call that results from a parsing failure or tool error. Create a dashboard with: (a) cost per route/day, (b) average iterations per session, (c) retry rate per tool. Only advance to the next levers after 5–7 days of baseline. **Testable:** you can answer 'which route cost the most yesterday and how many iterations did it average?'

2. **Lever 2 — Model routing (highest ROI, highest risk if done poorly)** — Classify each loop step by complexity: (a) classification/extraction/routing → lightweight model (e.g., Claude 3 Haiku, Titan Lite); (b) multi-step reasoning, long synthesis, complex code generation → mid/high model (e.g., Claude 3 Sonnet/Opus). Implement a simple router: before each call, a lightweight classifier (can be the cheap model itself) decides which model to use. **Practical rule:** if the task has an expected response under 50 tokens and doesn't require chained reasoning, use the cheap model. **Risk:** don't test only on the happy path — test with ambiguous, multilingual, typo-ridden inputs. Quality regression on edge cases is the main problem of poorly calibrated routing.

3. **Lever 3 — Prompt caching** — Bedrock Prompt Caching allows marking prompt prefixes (system prompt, context documents, few-shot examples) for reuse across calls. Cached tokens cost less per read than normally processed tokens. **When to use:** long system prompts (>1000 tokens) repeated across many requests; fixed reference documents (manuals, policies) injected in every context; static few-shot examples. **When NOT to use:** highly dynamic contexts where the prefix changes every call (cache never hits); low-volume sessions where managing TTL overhead isn't worth it. **Implementation:** mark the stable prefix with `cachePoint` in the Bedrock API. Monitor `cacheReadInputTokenCount` vs `cacheWriteInputTokenCount` — if read/write ratio < 2, the cache isn't paying off.

4. **Lever 4 — Lean context (you pay for input tokens — don't push the full history)** — In agents with multiple iterations, conversation history is injected as context in each call. If you inject the full history, input cost grows O(n²) with the number of turns. **Strategies:** (a) **Relevance-based truncation:** keep only the N most recent turns + the current plan; (b) **Progressive summarization:** every K iterations, call the cheap model to summarize history and replace the K turns with the summary; (c) **Compact tool results:** when a tool returns a large JSON, extract only the relevant fields before injecting into context — don't inject the raw payload. **Caution:** naive truncation (cutting the first N tokens) can remove critical context. Always preserve: the original user instruction, the current plan, and the last tool result.

5. **Lever 5 — Stop rules and fewer iterations** — Agents without explicit stop criteria 'overthink': the model keeps planning and calling tools even when it already has enough information to respond. This is pure cost. **Implement:** (a) `max_iterations` hardcoded per task type (not a global value — a field extraction task has max 3 iterations; a deep research task can have 10); (b) **Confidence-based stop condition:** if the model returns a `confidence` field in structured output > threshold, immediately break the loop; (c) **Early exit by tool result:** if a tool returns a definitive result (e.g., record found in DB), don't continue the loop — respond directly. **Anti-pattern:** high global `max_iterations` as 'safety' — that's an invitation for expensive loops in production.

6. **Lever 6 — Structured output + batch/async** — **Structured output:** when the agent needs to return structured data (JSON, field list), enforce the schema via `response_format` or via explicit prompt instruction. Unstructured output that fails parsing becomes a retry — and retry is duplicated cost. Use JSON Schema when the model supports it natively (Bedrock Converse API); otherwise, include the schema in the system prompt with an example. **Batch inference:** for async workloads (nightly reports, queued document processing, batch analysis), use Bedrock Batch Inference — the cost per token is significantly lower than on-demand [verify current pricing at aws.amazon.com/bedrock/pricing]. The trade-off is latency: batch has an SLA of hours, not seconds.

## Before vs. After per lever — impact, quality risk, effort
| Criterion | Lever | Where it cuts cost | Potential savings [example] | Quality risk | Implementation effort |
| --- | --- | --- | --- | --- | --- |
| Instrumentation | Visibility — doesn't cut directly | — | None | Low (1–3 days) | None |
| Model routing | Cost per call on simple tasks | 30–60% on routed routes [example] | High if poorly calibrated — test edge cases | Medium (1–2 weeks + tests) | Quality baseline per route |
| Prompt caching | Repeated input tokens (system prompt, docs) | 20–50% on prefix tokens [example] | Low — doesn't change logic | Low (hours) | Stable system prompt, sufficient volume |
| Lean context | Input tokens per loop iteration | 20–40% on input cost in long loops [example] | Medium — wrong truncation loses critical context | Medium (1 week) | Map of which history fields are essential |
| Stop rules | Unnecessary loop iterations | 10–30% in agents with long loops [example] | Low if max_iterations per task type | Low (days) | Iteration distribution per task type |
| Structured output | Retries from parsing failure | Eliminates parsing retry cost [can be 5–20% of total, example] | Low — improves consistency | Low (hours–days) | Product-defined schema |
| Batch/async | Cost per token in workloads without real-time SLA | Significant discount vs. on-demand [check current pricing] | None — same model quality | Medium (async architecture) | Workload without latency SLA |

## The hidden cost: silent retry and why it's the most dangerous

In production agents, silent retry is the cost multiplier nobody sees until they look at the logs carefully. It happens in three main forms:

**1. Retry from parsing failure:** the model returns free text when you expected JSON, or returns JSON with missing fields. The orchestrator detects the error, builds a new prompt asking the model to 'try again in the correct format', and makes another full call. Each such retry costs the same as an original call — and you pay twice for the same result (when it works on the second attempt).

**2. Retry from tool error:** an external tool (API, database, internal service) fails with a timeout or 5xx error. The agent tries again, sometimes with the same input, sometimes with a variation. If the tool is unstable, this becomes a retry loop that consumes tokens without producing value.

**3. Retry from instruction ambiguity:** the model interprets the instruction differently than expected, produces a result that downstream code rejects, and the orchestrator re-injects the instruction with more context. This is the most expensive because the retry prompt is larger than the original.

The solution for (1) is structured output with explicit schema. The solution for (2) is circuit breaker on external tools + retry with backoff at the tool level, not at the agent loop level. The solution for (3) is more precise prompt engineering + instruction testing before going to production.

What makes silent retry dangerous isn't just the cost — it's that it masks quality problems. If your agent has a 15% retry rate and you don't know it, you're paying 15% more and believing the system is more reliable than it is.

## Agent cost flow — where each lever cuts

Diagram shows the internal loop of a Bedrock agent with cost points and where each optimization lever acts. Input tokens grow with accumulated history; retries duplicate calls; unnecessary iterations multiply everything.

### 👤 Entrada do usuário

- Usuário User request (user)

### 🔀 Roteador / Router

- Roteador de Modelo [Alavanca 2] Model Router (compute)

### 🧠 Loop do Agente / Agent Loop

- Orquestrador Orchestrator (compute)
- Context Builder [Alavanca 4: contexto enxuto] [Lever 4: lean context] (compute)
- Chamada ao Modelo Model Call 💰 INPUT TOKENS 💰 OUTPUT TOKENS (ai)
- Stop Rule Check [Alavanca 5] max_iterations / confidence (compute)
- Output Parser [Alavanca 6: structured output] JSON Schema validation (compute)

### 🔧 Ferramentas / Tools

- Tool Executor API / DB / Service (external)
- Circuit Breaker (evita retry de tool) (prevents tool retry) (compute)

### 💾 Cache / Cache

- Prompt Cache [Alavanca 3] Bedrock cachePoint 💰 cache read < input token (storage)

### 📦 Batch / Async

- Fila Assíncrona Async Queue [Alavanca 6: batch] SQS / EventBridge (messaging)
- Bedrock Batch Inference 💰 menor custo/token 💰 lower cost/token (ai)

### 📊 Observabilidade / Observability

- CloudWatch EMF [Alavanca 1] tokens/rota, tokens/iteração is_retry flag (data)
- Dashboard Custo/rota, P95 iterações taxa de retry (data)

### Flows

- user -> router: request
- router -> orchestrator: selected model
- orchestrator -> context_builder: history + plan
- context_builder -> prompt_cache: static prefix
- context_builder -> model_call: lean context
- model_call -> stop_check: response + confidence
- stop_check -> output_parser: continue
- stop_check -> user: early exit
- output_parser -> tool_exec: tool call
- output_parser -> orchestrator: retry (parsing fail) ⚠️
- tool_exec -> circuit_breaker: result / error
- circuit_breaker -> orchestrator: tool result
- circuit_breaker -> orchestrator: retry tool ⚠️
- model_call -> emf: tokens + is_retry
- emf -> dashboard
- batch_queue -> batch_inference: async workload

> **Anti-patterns that will bite you in production:** **1. Optimizing without measuring first.** Swapping the model without a quality baseline is the fastest path to a silent regression. You'll cut cost, the CFO will be happy, and three weeks later someone will notice the resolution rate dropped. Measure before, measure after, compare rigorously.

**2. Swapping to a worse model on tasks that seem simple but aren't.** 'Intent classification' seems trivial until you discover your domain has 40 intents, some with high semantic overlap, and the cheap model misclassifies 8% of cases — and those 8% are exactly the high-value cases. Test with real data from your domain, not generic benchmarks.

**3. Prompt caching on dynamic context.** If your system prompt changes per user, per session, or per feature flag, the cache will never hit and you'll pay the write overhead with no benefit. Cache only works when the prefix is truly static and volume is high enough to amortize the write.

**4. High global `max_iterations` as 'safety'.** A value like `max_iterations=20` in production isn't safety — it's a blank check for pathological loops. Define per task type, with conservative values, and actively monitor the iteration distribution.

**5. Ignoring output token cost in reasoning models.** Models with 'thinking' or 'extended reasoning' (e.g., Claude with thinking enabled) generate internal reasoning tokens that are billed. For simple tasks, disable thinking — you're paying for a reasoning process the task doesn't need.

> **Rule of thumb:** **If you can only do one thing: instrument tokens per route and per iteration before any optimization.** An agent's cost is almost never where you think it is. Without data, you'll optimize the wrong part and won't know if it worked. With data, the right levers become obvious within 48 hours of analysis.

> **My perspective — what I actually do in practice:** When I receive an agent with a cost problem, the first thing I do is ask for the raw logs from the last 48 hours with `input_tokens`, `output_tokens`, `model_id`, `step_index`, and `is_retry`. In 80% of cases, the problem is one of three things: (a) a 3000-token system prompt being injected at every iteration without caching; (b) raw 10k-token tool results being pushed into the full context; or (c) parsing retry rate above 10% because the prompt doesn't specify the output schema.

Model routing is the highest-ROI lever, but also the one I see done wrong most often. Teams route based on 'this step seems simple', without testing with real domain data. My approach: before routing any step to the cheap model, I create a set of 100 representative cases (including domain edge cases), run both models, and only route if the quality difference is < product-defined threshold. This takes 2–3 days but prevents the silent regression that shows up 3 weeks later.

On prompt caching: it's the easiest lever to implement and the most underestimated. If you have a long, static system prompt, enabling cache in Bedrock is literally adding one field in the API and monitoring the hit rate. The only care is ensuring the prefix marked as cache truly doesn't change between requests — if it does, you pay write without gaining read.

One thing I learned in financial systems: hidden cost is hidden risk. Silent retry isn't just expensive — it's a signal that the system is failing in ways you're not seeing. Instrumenting retry is instrumenting reliability.

## AWS Well-Architected Alignment

- **security**: When truncating history and summarizing context, ensure sensitive data doesn't persist in summaries beyond what's necessary. Review token log retention policies — they may contain business data.
- **reliability**: Circuit breaker on external tools and structured output reduce both cost and silent failures. Stop rules with max_iterations per task type prevent infinite loops in production.
- **performance**: Lean context reduces latency beyond cost — fewer input tokens = lower processing time. Prompt caching reduces latency on first iterations with long system prompts.

## References

- [Amazon Bedrock — Pricing](https://aws.amazon.com/bedrock/pricing/)
- [Amazon Bedrock — Prompt Caching (User Guide)](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html)
- [Amazon Bedrock — Batch Inference (User Guide)](https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html)
- [AWS Well-Architected — Cost Optimization Pillar](https://docs.aws.amazon.com/wellarchitected/latest/cost-optimization-pillar/welcome.html)

## Verdict

GenAI FinOps isn't about using the cheapest model — it's about not wasting tokens on poor design. The right model for the right task, surgical context, cache where the prefix is stable, stop rules that respect task complexity, and structured output that eliminates retries: these six levers, applied with data, deliver more result per dollar than any blind model swap. The most dangerous hidden cost isn't the token you see on the bill — it's the silent retry that tells you the system is failing in ways you haven't measured yet.

## Case sources

- [Amazon Bedrock — Pricing](https://aws.amazon.com/bedrock/pricing/)
- [Amazon Bedrock — Prompt caching](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html)
- [Amazon Bedrock — Batch inference](https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html)
- [AWS — Cost Optimization Pillar (Well-Architected)](https://docs.aws.amazon.com/wellarchitected/latest/cost-optimization-pillar/welcome.html)