# GPT-5 vs Claude vs Nova on Bedrock: A Production Governance Bake-off

With GPT-5.5 and Codex landing on Amazon Bedrock, platform teams now face a genuine choice between three frontier model families within the same control plane. This analysis compares GPT-5.5, Claude 3.7 Sonnet, and Amazon Nova Pro through the lens of teams shipping AI into regulated production environments.

- URL: https://fernando.moretes.com/blog/bedrock-2026-governanca-modelos-frontier

- Markdown: https://fernando.moretes.com/blog/bedrock-2026-governanca-modelos-frontier/article.md?lang=en

- Published: 2026-05-23T09:12:00.000Z

- Category: AI & Agents

- Tags: bedrock, gpt-5, claude, nova, ai-governance, llm-ops, financial-grade, aws

- Reading time: 7 min

- Source: [OpenAI GPT-5.5, GPT-5.4 and Codex on Amazon Bedrock](https://aws.amazon.com/blogs/aws/)

---

The arrival of GPT-5.5, GPT-5.4, and Codex on Amazon Bedrock is not just a product event — it is a signal that Bedrock is consolidating as the unified control plane for frontier models in enterprise environments. For teams operating in regulated sectors, the question has shifted from 'which model to use?' to 'how do we govern multiple frontier models with the same security, traceability, and cost controls we already apply to the rest of our AWS infrastructure?' This analysis does exactly that bake-off: GPT-5.5 vs Claude 3.7 Sonnet vs Amazon Nova Pro, focused on production, not benchmarks.

## What changed when GPT-5 landed on Bedrock

Before OpenAI models arrived on Bedrock, choosing GPT-4 or GPT-4o meant leaving the AWS perimeter: direct calls to the OpenAI API, secrets managed outside Secrets Manager, logs that bypassed CloudTrail, and data potentially leaving your residency region. For teams requiring LGPD, PCI-DSS, or SOC 2 compliance, that was a real governance cost, not a theoretical one.

With GPT-5.5 and Codex available via `bedrock:InvokeModel` and `bedrock:InvokeModelWithResponseStream`, the model becomes just another ARN resource. That means the IAM policies you already have — including conditions like `aws:RequestedRegion`, `bedrock:modelId`, and `aws:PrincipalTag` — apply directly. CloudTrail records every invocation. Amazon Bedrock Guardrails, with its content filters, PII detection, and grounding checks, covers GPT-5.5 the same way it covers Claude or Nova.

What this does not solve: network latency to regions where the model is still served via cross-region endpoints, and the fact that GPT-5.5 weights do not reside in your account — you are consuming a hosted model, not a deployed one. For use cases requiring full inference isolation, such as document analysis with classified customer data, this remains a threat model item that needs explicit documentation.

## The dimension benchmarks miss: operational behavior

Academic benchmarks measure capability under controlled conditions. In financial production, what matters is behavior under load, p99 latency consistency, and the real cost of a response — not the average cost, but the cost of an 8k-token prompt with 2k output at peak hours.

Claude 3.7 Sonnet has a characteristic that matters greatly for agentic workflows: extended thinking mode produces chained reasoning that is auditable. In compliance contexts, being able to show the intermediate reasoning of a credit decision or fraud triage has direct regulatory value. GPT-5.5 also supports chain-of-thought, but the level of control over reasoning verbosity and the separation between scratchpad and final output is still less granular via the Bedrock API than what Anthropic exposes natively.

Amazon Nova Pro, on the other hand, is the only one of the three where you have full visibility into the model lifecycle within AWS. It supports fine-tuning via Bedrock Custom Model Jobs, meaning you can adapt the model to domain-specific vocabulary — derivatives terminology, for example — without relying on prompt engineering. Nova Pro's cost per token is significantly lower, which matters when you are processing millions of documents in batch with Bedrock Batch Inference.

The most common failure mode I see in production is not the model being wrong — it is the system lacking sufficient observability to know when the model was wrong. That leads directly to the instrumentation question.

## Unified Control Plane: Model Governance on Bedrock

Inference request flow through Bedrock governance layers, showing how GPT-5.5, Claude, and Nova share the same security and observability controls

### 🔐 AWS — Segurança e Identidade

- IAM Policy bedrock:modelId condition (security)
- KMS Encrypt at rest / in transit (security)

### 🟧 Amazon Bedrock — Plano de Controle

- Bedrock Guardrails PII, content, grounding (security)
- Bedrock API Gateway InvokeModel / Stream (compute)
- Model Invocation Logging S3 + CloudTrail (data)

### 🤖 Modelos Frontier

- GPT-5.5 / Codex OpenAI via Bedrock (ai)
- Claude 3.7 Sonnet Extended Thinking (ai)
- Amazon Nova Pro Fine-tune + Batch (ai)

### 📊 Observabilidade

- CloudWatch Latency P99 / Tokens (data)
- OpenTelemetry Trace ID por invocação (data)

### Flows

- client -> iam: authentication
- iam -> guardrails: policy enforced
- guardrails -> gateway: filtered prompt
- gateway -> gpt5: InvokeModel
- gateway -> claude: InvokeModel
- gateway -> nova: InvokeModel
- gateway -> logging: async log
- kms -> logging: encryption
- logging -> cw: metrics
- gateway -> otel: trace span

## Instrumentation: where most teams get it wrong

Bedrock emits native metrics to CloudWatch: `InvocationLatency`, `InputTokenCount`, `OutputTokenCount`, `InvocationClientErrors`, `InvocationThrottles`. But these metrics alone are insufficient to operate an AI system in financial production. What is missing is correlation between the model invocation and business context — which user, which product, which decision was influenced by that response.

The approach that works is instrumenting with OpenTelemetry at the application level, propagating a trace ID that crosses the Bedrock call and is included in the Model Invocation Logging payload. When you enable Model Invocation Logging with S3 + CloudWatch Logs as destination, each record includes the Bedrock `requestId`. If you inject that `requestId` as an attribute in your OTel span, you can correlate a customer complaint with the exact prompt and response that generated that decision — that is real auditability.

For GPT-5.5 specifically, one watch point: the model supports `response_format: json_object` and structured outputs, but schema validation happens on the model side, not in Guardrails. If you need to guarantee that the response respects a specific schema before persisting to DynamoDB, add a validation step in the Lambda that processes the response — do not assume the model will always return valid JSON under load or with adversarial prompts.

Claude 3.7 with extended thinking exposes the reasoning block as a separate field in the response. Store that field in S3 with a 7-year retention policy if you are in a regulated environment — it is decision-making evidence, not just a technical log.

## Real cost: beyond price per token

Frontier model cost comparisons frequently stop at input/output token price. That is the smallest component of total cost in production systems. The components that dominate cost are: (1) tokens wasted by poorly structured prompts, (2) retries due to throttling, and (3) the cost of operating the system around the model.

GPT-5.5 has a higher price per token than Claude 3.7 Sonnet and significantly higher than Nova Pro. For a document analysis workload processing 10 million pages per month with an average context of 4k tokens per page, the cost difference between GPT-5.5 and Nova Pro can be on the order of 5-8x. This is not an argument against using GPT-5.5 — it is an argument for using it selectively, in cases where its differentiated reasoning capability justifies the cost.

Bedrock Batch Inference changes the calculation for async workloads. With batch, you get up to 50% discount on token price for Claude and Nova. GPT-5.5 on Bedrock does not yet support batch inference at the time of this analysis — meaning that for large-scale processing, you need to manage your own queue (SQS + Lambda with reserved concurrency) and handle the account-level TPM (tokens per minute) limits.

Bedrock TPM limits for third-party models like GPT-5.5 are managed via service quota, and increases require an AWS Support request. In multi-tenant environments where multiple products share the same AWS account, this can become a bottleneck. The solution is to use AWS Organizations with separate accounts per product and independent quotas — do not share TPM limits between critical and experimental workloads.

## GPT-5.5 vs Claude 3.7 Sonnet vs Amazon Nova Pro — Technical Comparison
| Criterion | Dimension | GPT-5.5 (OpenAI via Bedrock) | Claude 3.7 Sonnet (Anthropic) | Amazon Nova Pro |
| --- | --- | --- | --- | --- |
| Relative cost per token (input) | High (baseline ~$3/MTok) | Medium (~$3/MTok) | Low (~$0.8/MTok) | — |
| Batch Inference support (Bedrock) | No (at time of analysis) | Yes — up to 50% discount | Yes — up to 50% discount | — |
| Fine-tuning via Bedrock | Not available | Not available | Yes — Custom Model Jobs | — |
| Auditable reasoning (structured CoT) | Partial — via structured outputs | Yes — separate extended thinking block | Partial — via prompt engineering | — |
| Bedrock Guardrails coverage | Yes — same controls | Yes — same controls | Yes — same controls | — |
| P50 latency (2k token prompt) | ~1.8s (estimate; varies by region) | ~1.5s (without extended thinking) | ~1.2s | — |
| Model weight residency in AWS account | No — hosted by OpenAI | No — hosted by Anthropic | Yes — Amazon-native | — |
| Codex / specialized code generation | Yes — Codex available on Bedrock | Strong — Claude 3.7 is top-tier for code | Competent — best with fine-tuning | — |

## Decision Matrix: Which Model for Which Workload?

### GPT-5.5 via Bedrock

**Pros**
- Top-tier reasoning capability for complex, ambiguous tasks
- Codex for code generation in AI-assisted DevOps pipelines
- Unified governance via IAM, CloudTrail, and Guardrails — no AWS perimeter exit
- Structured outputs with JSON schema for direct downstream system integration

**Cons**
- Higher cost per token; no batch inference support on Bedrock
- Weights do not reside in AWS account — implications for sensitive data threat models
- TPM limits managed via service quota; increases require AWS Support
- No fine-tuning available via Bedrock

**Verdict:** Best for: high-complexity reasoning tasks (legal analysis, due diligence), code generation in CI/CD pipelines, and cases where response quality justifies the premium cost.

### Claude 3.7 Sonnet

**Pros**
- Extended thinking with separate, auditable reasoning block — direct regulatory value
- Batch inference support with up to 50% discount for async workloads
- Excellent at code and technical analysis; consistent in long contexts
- Competitive pricing with GPT-5.5 for comparable quality on many tasks

**Cons**
- Extended thinking significantly increases latency — not suitable for real-time inference
- No fine-tuning via Bedrock; adaptation relies on prompt engineering and RAG
- Weights also do not reside in AWS account

**Verdict:** Best for: agentic workflows requiring auditable reasoning, regulatory document analysis, fraud triage with explainability, and high-quality batch processing.

### Amazon Nova Pro

**Pros**
- Lowest cost per token — 5-8x cheaper than GPT-5.5 for high-volume workloads
- Fine-tuning via Bedrock Custom Model Jobs — domain adaptation without prompt engineering
- Amazon-native weights; best posture for ultra-sensitive data threat models
- Batch inference support; lowest P50 latency among the three

**Cons**
- Reasoning capability below GPT-5.5 and Claude 3.7 on high-complexity tasks
- Fine-tuning requires quality dataset and MLOps pipeline — operational overhead
- Smaller third-party tool and integration ecosystem

**Verdict:** Best for: large-scale processing (millions of documents), classification and extraction tasks where fine-tuning pays off, and environments with stricter data sovereignty requirements.

> **The routing pattern that resolves the dilemma:** The answer to 'which model to use?' in mature production systems is not a single choice — it is a router. Implement an AI Gateway in Lambda or ECS that classifies each request by complexity, data sensitivity, and latency requirement, and routes to the appropriate model. Low-complexity, high-volume requests go to Nova Pro. Analyses requiring auditable reasoning go to Claude 3.7 with extended thinking. Code generation in CI/CD pipelines goes to Codex. Same Guardrails, same CloudTrail, same trace ID — unified governance with workload-optimized cost. This pattern reduces total cost by 40-60% compared to using GPT-5.5 for everything, without sacrificing quality where it matters.

## Numbers that guide the routing decision

- **5-8x** — Cost per token difference: GPT-5.5 vs Nova Pro. For high-volume workloads, intelligent routing is the largest cost lever available on Bedrock
- **50%** — Maximum discount with Batch Inference (Claude and Nova). Async workloads — document analysis, data enrichment — should use batch by default
- **7 anos** — Recommended retention for reasoning logs in regulated environments. Claude 3.7's extended thinking block is decision-process evidence for regulatory audits

## Anti-patterns I encounter in production

- Using GPT-5.5 for all workloads because it is 'the most capable' — ignores that 70% of tasks do not need frontier reasoning and pays 5-8x more for it
- Not enabling Model Invocation Logging — without prompt/response logs, regulatory audit is impossible and quality regression debugging is blind
- Assuming Bedrock Guardrails replaces schema validation in code — Guardrails filters content, not data structure; invalid JSON still passes through
- Sharing TPM limits between critical and experimental workloads in the same AWS account — a token-burst experiment can throttle a production feature
- Not propagating trace IDs in Bedrock calls — loses the correlation between business decision and model invocation, making incident investigations much slower

> **My curation note:** In practice, what I would do: start with Claude 3.7 Sonnet as the default model for any new workload in a financial environment — the auditable extended thinking is worth more than the cost difference versus Nova Pro when you are in a regulated sector. Introduce GPT-5.5 and Codex specifically for the AI-assisted DevOps pipeline, where code generation quality justifies the premium cost. Nova Pro would enter as a routing destination for large-scale classification and extraction, with fine-tuning trained on domain vocabulary. The lesson I learned the hard way: the biggest risk is not choosing the wrong model — it is not having sufficient observability to know when any model is wrong. Invest in trace IDs, Model Invocation Logging, and quality drift alerts before optimizing which model to use.

## Recommendation: do not choose a model, build a router

The arrival of GPT-5.5 and Codex on Bedrock does not make Claude or Nova obsolete — it completes the portfolio. The recommendation is clear: implement an AI Gateway that routes by workload, not by model preference. Use Claude 3.7 Sonnet as the default for tasks requiring auditable reasoning in regulated environments. Use GPT-5.5 and Codex for code generation and high-complexity reasoning tasks where the premium cost is justified by value. Use Nova Pro for large-scale processing and cases where domain fine-tuning is viable. In all cases: enable Model Invocation Logging on day one, propagate trace IDs, and treat TPM limits as an infrastructure quota requiring capacity planning — not a configuration detail. Unified governance on Bedrock is the real asset here; the models are commodities that will evolve. Build your platform around the controls, not around a specific model.

## References

- [Amazon Bedrock Model Invocation Logging](https://docs.aws.amazon.com/bedrock/latest/userguide/model-invocation-logging.html)
- [Amazon Bedrock Guardrails](https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html)
- [Amazon Bedrock Batch Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html)
- [Amazon Bedrock Custom Model Fine-tuning](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-models.html)
- [AWS Well-Architected — Machine Learning Lens](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/welcome.html)
- [Amazon Bedrock Service Quotas](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html)
- [OpenTelemetry for AWS Lambda and Bedrock](https://aws-otel.github.io/docs/getting-started/lambda)
- [AWS News Blog — Amazon Bedrock](https://aws.amazon.com/blogs/aws/)