# ADR: Model Selection on Amazon Bedrock — Claude vs Nova vs Llama vs Fine-tune

Architectural decision on which foundation model to adopt for an enterprise GenAI feature on Amazon Bedrock, evaluating quality, cost per token, latency, data governance, pt-BR support, and context window. The conclusion is a task-type routing strategy — no single model wins across all axes.

- URL: https://fernando.moretes.com/studies/adr-selecao-de-modelo-no-bedrock

- Markdown: https://fernando.moretes.com/studies/adr-selecao-de-modelo-no-bedrock/study.md?lang=en

- Type: Decision (ADR)

- Company: Feature GenAI enterprise (cenário)

- Domain: IA

- Status: proposed

- Date: 2026-02-25

- Tags: bedrock, llm, model-selection, genai, aws, cost-optimization, pt-BR, adr

- Reading time: 9 min

---

Choosing a language model for production is not a benchmark decision — it is an engineering decision with direct consequences on operational cost, perceived latency, data compliance, and product evolution capacity. In this ADR, I evaluate four model families available on Amazon Bedrock for an enterprise GenAI feature with multilingual requirements (pt-BR mandatory), estimated volume of tens of millions of tokens per month, and data restrictions that prevent sending PII to external APIs without explicit contractual control. The final recommendation is not a single model: it is a task-type routing strategy.

## Scenario Context

- **System:** Enterprise GenAI feature (composite scenario, representative of real cases)
- **Domain:** Artificial Intelligence / Natural Language Processing
- **AI Platform:** Amazon Bedrock (managed inference, no GPU management)
- **Models evaluated:** Claude 3.5 Sonnet / Haiku (Anthropic), Amazon Nova Pro / Lite / Micro, Llama 3.x (Meta), own fine-tune via Bedrock Custom Model
- **Language requirements:** pt-BR mandatory; en-US and es-419 desirable
- **Estimated volume:** ~50–150 M tokens/month (project estimate; varies by case)
- **Data restrictions:** Customer PII; LGPD applicable; data cannot leave AWS region without DPA
- **Primary AWS region:** us-east-1 (Bedrock GA); sa-east-1 as data region
- **Task types:** Long document summarization, structured extraction (JSON), draft generation, intent classification, Q&A over knowledge base

## Forces at Play

Every production model decision involves at least five simultaneous tensions that rarely point in the same direction.

**Quality vs. cost per token.** Frontier models like Claude 3.5 Sonnet deliver reasoning quality and textual coherence in pt-BR that smaller models simply cannot achieve in zero-shot. But the price difference between Sonnet and a cost-optimized model like Nova Micro can be two orders of magnitude. For binary classification or simple field extraction tasks, paying for Sonnet is measurable waste.

**Latency vs. context window.** Long documents — contracts, reports, call transcripts — require context windows of 100k+ tokens. Claude 3.5 Sonnet supports 200k token context. Smaller models frequently have 8k–32k windows, forcing chunking and additional retrieval logic that introduces latency and operational complexity. The model choice directly affects the ingestion architecture.

**Governance and data residency.** In Bedrock, inference occurs in the configured AWS region and Anthropic, Meta, and Amazon have signed terms of non-use of customer data for training. This resolves the basic contractual problem. However, fine-tuning with proprietary data requires that data to be stored in S3 and processed by Bedrock's customization service — which expands the audit surface and requires additional access controls (IAM, KMS, VPC endpoints).

**Multilingual pt-BR.** This is a force frequently underestimated in evaluations done with English benchmarks. Claude 3.x and Nova Pro have documented performance in pt-BR. Llama 3.1/3.2 significantly improved multilingual support, but still shows perceptible degradation in free generation tasks in Portuguese compared to English. Fine-tuning on pt-BR corpus can compensate for this gap, but creates a maintenance cycle.

**Product evolution speed.** Managed models (Claude, Nova) receive updates without engineering team intervention. An own fine-tune freezes the base behavior and requires periodic re-training to keep up with model improvements — an MLOps cost that needs to be in the denominator of the decision.

## Model Comparison on Amazon Bedrock (pricing reference: on-demand, us-east-1, 2025)
| Criterion | Model | Input ($/1M tokens) | Output ($/1M tokens) | Max context | pt-BR |
| --- | --- | --- | --- | --- | --- |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200k | Excellent | Medium-high (~2–5s TTFT) |
| Claude 3.5 Haiku | $0.80 | $4.00 | 200k | Good | Low (~0.5–1.5s TTFT) |
| Amazon Nova Pro | $0.80 | $3.20 | 300k | Good | Medium (~1–3s TTFT) |
| Amazon Nova Lite | $0.06 | $0.24 | 300k | Reasonable | Low (~0.3–1s TTFT) |
| Amazon Nova Micro | $0.035 | $0.140 | 128k | Limited | Very low (<0.5s TTFT) |
| Llama 3.1 70B Instruct | $0.99 | $0.99 | 128k | Reasonable (degradation in free generation) | Medium (~1–3s TTFT) |
| Own fine-tune (base: Nova/Llama) | Variable + training cost (estimate: $500–5k/run) | Variable | Inherited from base model | High (if pt-BR corpus is used) | Depends on base model |

## Why No Single Model Solves Everything

The natural temptation in any GenAI project is to choose the best available model and use it for everything. This eliminates routing complexity, simplifies prompts, and reduces test surface. The problem is that this approach optimizes for the worst-case cost.

Consider the task mix in this scenario. Intent classification — determining whether a message is a complaint, an information request, or praise — is a task that an 8B parameter model with a good prompt resolves with F1 above 0.90. Using Claude 3.5 Sonnet for this is like using a scalpel to cut bread. Structured field extraction from standardized documents (Brazilian tax IDs, monetary values) is equally solvable with smaller models and well-constructed prompts.

On the other hand, summarization of long legal documents in pt-BR with preservation of technical nuances, or generation of communication drafts that need to sound natural and formal at the same time, are tasks where the difference between Sonnet and Haiku is perceptible to a human reviewer. Here, saving on the model means saving on product quality.

The context window adds another dimension. Nova Pro supports 300k tokens — larger than Sonnet — at an input cost of $0.80/1M tokens versus Sonnet's $3.00/1M. For summarization of very long documents where the full context needs to be available, Nova Pro is a serious alternative that needs to be evaluated with real pt-BR data before being dismissed.

The argument for Llama on Bedrock is different: it is not primarily cost, it is weight auditability. Regulated sectors (financial, healthcare) sometimes need to demonstrate that they understand the model they are using. Open-weight does not mean you will re-train — it means you can inspect, that there is academic literature on the model's behavior, and that vendor dependency is structurally lower. This argument carries real weight in compliance committees.

## Options Evaluated

### Option A — Claude 3.5 Sonnet for everything

**Pros**
- Maximum quality in reasoning and generation in pt-BR
- 200k token context; zero chunking for most documents
- Simple architecture: one model, one endpoint, one prompt set
- Lower regression risk from model changes

**Cons**
- Cost 40–85x higher than Nova Micro for simple tasks
- Higher TTFT impacts UX in low-latency features
- Total dependency on single vendor (Anthropic via AWS)

**Verdict:** Viable for MVPs and low volumes; unsustainable at scale with simple task mix

### Option B — Amazon Nova Pro for everything

**Pros**
- 300k token context — largest in category
- Input cost ~73% lower than Sonnet; output ~79% lower
- AWS native: data residency, frictionless IAM/KMS integration
- Roadmap aligned with AWS services (Agents, Knowledge Bases)

**Cons**
- Free generation quality in pt-BR inferior to Sonnet in complex tasks (empirical evaluation needed)
- Newer model: less behavioral literature available
- Still oversized for classification and simple extraction

**Verdict:** Strong candidate for medium-high complexity tasks; requires pt-BR benchmark before replacing Sonnet

### Option C — Llama 3.1 70B for lower-cost tasks

**Pros**
- Open-weight: auditable, no weight lock-in
- Flat input/output price ($0.99/1M) — predictable for output-heavy workloads
- Large community; external fine-tuning possible if needed

**Cons**
- pt-BR with perceptible degradation in free generation vs. models trained with more Portuguese data
- 128k context — chunking needed for very long documents
- Higher cost than Nova Lite/Micro for simple tasks

**Verdict:** Best justification in regulated contexts requiring weight auditability; not the standard cost-benefit choice

### Option D — Own fine-tune on Bedrock

**Pros**
- Domain-specialized behavior: terminology, tone, output format
- Token reduction potential via shorter prompts (implicit behavior)
- Full control over training corpus

**Cons**
- Training + evaluation + versioning cost and time (real MLOps)
- Base model freezes: frontier improvements not automatically inherited
- Expanded audit surface: training data in S3, customization pipeline, weight versioning
- Overfitting risk in narrow domain; reduced generalization

**Verdict:** Justified only when there is a measurable gap that no managed model resolves and sufficient volume to amortize MLOps cost

## Architectural Decision

**Status:** proposed

**Context**

Enterprise GenAI feature with heterogeneous task mix (simple classification to complex summarization), mandatory pt-BR requirement, LGPD restrictions, and volume that makes inference cost a relevant budget item. No single model offers the best cost-benefit across all task types.

**Decision**

Adopt task-type routing with three primary routes on Amazon Bedrock: (1) Claude 3.5 Sonnet for high-complexity tasks — long document summarization, draft generation requiring editorial quality, multi-step reasoning; (2) Amazon Nova Pro or Claude 3.5 Haiku for medium-complexity tasks — Q&A over knowledge base, structured extraction from non-standardized documents, conversational response generation; (3) Amazon Nova Lite or Nova Micro for low-complexity tasks — intent classification, standardized field extraction, message triage and routing. Own fine-tune is explicitly discarded at this time due to absence of a measurable gap that justifies the MLOps cost. The decision between Nova Pro and Claude 3.5 Haiku in the intermediate route must be resolved by empirical benchmark in pt-BR with real domain data before go-live.

**Consequences**
- POSITIVE: Estimated 60–75% reduction in inference cost vs. exclusive Sonnet use for 100M tokens/month volume (project estimate based on typical task mix)
- POSITIVE: Quality preserved in tasks where it impacts the product; no perceptible degradation for the end user
- POSITIVE: Routing architecture allows model swap per route without global refactoring — independent evolution
- NEGATIVE: Increased operational complexity: three prompt sets, three model configurations, mandatory per-route observability
- NEGATIVE: Task classification logic is a new component that needs to be tested, versioned, and monitored
- NEUTRAL: Empirical pt-BR benchmark on intermediate route is a go-live prerequisite — adds ~2 weeks to schedule

## Data Governance and LGPD in the Bedrock Context

A question that frequently paralyzes GenAI decisions in Brazilian companies is LGPD compliance when customer data transits through third-party models. Amazon Bedrock structurally resolves part of this problem, but not all of it.

**What Bedrock guarantees by design:** Inference occurs within AWS infrastructure in the configured region. AWS, Anthropic, and Meta have signed contractual terms (available in Bedrock service terms) that customer data sent for inference is not used to train or improve base models. This is different from using the public Anthropic or OpenAI API directly, where standard terms may permit use of data for service improvement. For LGPD purposes, Bedrock qualifies as a data operator with DPA (Data Processing Agreement) available via AWS.

**What still needs to be managed:** Even with Bedrock, PII that enters the prompt needs to be handled. The recommendation is to implement a sanitization layer before the model call — replacing CPF, CNPJ, names, and sensitive data with tokens or pseudonyms when the task does not require real values (e.g., intent classification does not need the real CPF; contract summarization can pseudonymize parties). This reduces residual risk and simplifies auditing.

**Fine-tune and the training data problem:** If the fine-tune decision is revisited in the future, training data stored in S3 for the Bedrock customization job becomes a sensitive data asset with its own lifecycle. Retention, encryption with KMS customer-managed key, access control with IAM least-privilege, and auditing via CloudTrail are obligations — not optional. The compliance cost of fine-tuning goes beyond the compute cost of training.

**Region and latency:** Bedrock with full Claude and Nova support is available primarily in us-east-1 and us-west-2. For data that cannot leave Brazil, the architecture needs a sanitization layer in sa-east-1 before routing to us-east-1 for inference — or wait for model availability in sa-east-1, which is expanding. This region restriction is a roadmap item that needs to be on the architecture team's radar.

## Task-Type Routing Architecture on Amazon Bedrock

Inference flow with intelligent routing: the request enters, is sanitized, classified by task complexity, and routed to the appropriate model on Bedrock. Automatic fallback guarantees minimum quality.

### 👤 Client Layer

- Client App (Web/Mobile/API) (user)

### 🔒 API & Security Layer (sa-east-1)

- API Gateway + WAF (edge)
- PII Sanitizer (Lambda) Macie scan (security)
- Cognito / IAM AuthZ (security)

### 🧠 Routing Layer (Lambda / ECS — us-east-1)

- Task Router (Lambda) Classifica complexidade (compute)
- Intent Classifier Nova Micro (bootstrap) (ai)
- Fallback Logic Conf. threshold < 0.85 → escala (compute)

### 🤖 Amazon Bedrock — Model Routes (us-east-1)

- Claude 3.5 Sonnet Rota Alta Complexidade 200k ctx | $3/$15 per 1M (ai)
- Nova Pro / Haiku Rota Média Complexidade 300k ctx | $0.80/$3.20 (ai)
- Nova Lite / Micro Rota Baixa Complexidade 300k ctx | $0.06/$0.24 (ai)

### 📊 Observability & Data (sa-east-1 / us-east-1)

- CloudWatch Métricas por rota Custo por tarefa (data)
- S3 + Athena Audit logs Prompt traces (storage)
- Knowledge Base Bedrock KB (OpenSearch) (data)

### Flows

- client -> apigw: HTTPS request
- apigw -> authz: AuthN/AuthZ
- apigw -> sanitizer: Sanitize PII
- sanitizer -> router: Clean payload
- router -> classifier: Classify task
- classifier -> fallback: Confidence score
- fallback -> sonnet: High complexity
- fallback -> haiku_nova: Medium complexity
- fallback -> nova_lite: Low complexity
- haiku_nova -> kb: RAG lookup
- sonnet -> kb: RAG lookup
- router -> cloudwatch: Route metrics
- sonnet -> s3_logs: Audit trace
- haiku_nova -> s3_logs
- nova_lite -> s3_logs

## AWS Well-Architected Framework Assessment

- **security**: PII sanitized before reaching the model; IAM least-privilege per route; KMS CMK for logs and S3 data; VPC endpoints for Bedrock eliminate public internet traffic; CloudTrail enabled for all Bedrock API calls.
- **reliability**: Confidence threshold fallback prevents silent degradation; retry with jitter on Bedrock calls; circuit breaker recommended for high-frequency routes; Knowledge Base with OpenSearch in multiple AZs.
- **sustainability**: Using smaller models for simple tasks reduces computational consumption; avoiding context over-provisioning (not sending 200k tokens when 10k suffices) is both a cost and energy optimization.

> **My Senior Take:** The most common mistake I see in enterprise GenAI projects is not choosing the wrong model — it is not having a routing strategy from the start. Teams choose the best available model for the MVP (usually Sonnet or GPT-4), the product works well, and when volume scales, inference cost becomes a CFO problem, not an engineering one. Then the pressure to switch models is enormous, the prompts were written for the expensive model, and the migration is painful.

My practical recommendation: start with routing from the first week, even if initially all routes point to the same model. The routing abstraction costs nothing in complexity at the start and is worth a lot when you need to swap. It is the same principle as writing code against an interface, not against an implementation.

On the Nova vs. Claude Haiku debate for the intermediate route: I would not decide on paper. Bedrock has a Model Evaluation API that allows running a set of prompts from your domain and comparing outputs with defined criteria. For pt-BR specifically, I would build a golden dataset with 50–100 representative domain examples, run both models, and evaluate with a combination of automatic metrics (ROUGE for summarization, exact match for extraction) and human evaluation for free generation. This decision should be empirical, not based on general English benchmarks.

On fine-tuning: the right question is not 'will fine-tuning improve quality?' — the answer is almost always yes. The right question is 'does the quality gain justify the recurring MLOps cost?' In most cases I have seen, the answer is no — especially when the base model evolves rapidly. Fine-tuning makes sense when you have very specific domain vocabulary, a rigid output format that prompt engineering cannot guarantee, or when token reduction via implicit behavior has material financial impact. Outside these cases, it is complexity without proportional return.

Finally: monitor cost per task type from day one. This is not just financial optimization — it is the most honest signal that your routing is working.

## Verdict

The model decision on Amazon Bedrock is not a binary choice — it is a portfolio strategy. For an enterprise GenAI feature with a heterogeneous task mix and pt-BR requirement, the task-type routing architecture is the only approach that sustainably balances quality, cost, and governance at scale.

Claude 3.5 Sonnet remains the quality reference for complex tasks in Portuguese and should be the default choice when generation quality is critical to the product. Amazon Nova Pro and Claude 3.5 Haiku compete directly in the intermediate range — the decision between them should be empirical, with benchmark on real domain data in pt-BR. Amazon Nova Lite and Micro are the obvious choices for classification and simple extraction tasks, where the quality gain from larger models is not perceptible to the end user.

Own fine-tuning is an MLOps decision, not a model quality decision — and should be treated as such. The recurring maintenance cost of a custom model rarely justifies itself when managed models evolve at the current pace. Revisit this decision when there is a measurable gap and sufficient volume for amortization.

The operational complexity of routing is real but manageable: routing abstraction as an interface, versioned prompts, per-route observability, and a pt-BR evaluation golden dataset are the four artifacts that transform this architecture from an experiment into a reliable production system.

## References

- [Amazon Bedrock — Supported Foundation Models](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html)
- [Amazon Bedrock — Pricing](https://aws.amazon.com/bedrock/pricing/)
- [Amazon Nova Models — AWS Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/amazon-nova.html)
- [Anthropic Claude on Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-anthropic-claude-messages.html)
- [Amazon Bedrock — Model Evaluation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation.html)
- [Amazon Bedrock — Custom Model Fine-tuning](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-models.html)
- [Amazon Bedrock — Data Protection and Privacy](https://docs.aws.amazon.com/bedrock/latest/userguide/data-protection.html)
- [Meta Llama 3 on Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-meta.html)

## Case sources

- [Amazon Bedrock — Supported foundation models](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html)
- [Amazon Bedrock — Pricing](https://aws.amazon.com/bedrock/pricing/)