# ADR: Self-Hosted LLM (EKS + vLLM) vs Managed API (Bedrock)

This ADR evaluates the decision between self-hosting large language models on EKS with vLLM versus consuming models via Amazon Bedrock. The analysis covers cost per token, MLOps operational burden, cold start, data residency, and token-volume break-even point.

- URL: https://fernando.moretes.com/studies/adr-llm-self-hosted-eks-vs-bedrock

- Markdown: https://fernando.moretes.com/studies/adr-llm-self-hosted-eks-vs-bedrock/study.md?lang=en

- Type: Decision (ADR)

- Company: Plataforma de IA (cenário)

- Domain: IA / Custo

- Status: accepted

- Date: 2026-02-28

- Tags: llm, bedrock, vllm, eks, gpu, mlops, cost-optimization, ai-platform

- Reading time: 8 min

---

Self-hosting your own LLM on dedicated GPUs looks attractive on paper — full control, low marginal cost at scale, no API vendor lock-in. In practice, you are trading a billing problem for an engineering problem. This ADR documents the decision made for a growing AI platform: when Bedrock is the right call, and at what point self-operated vLLM on EKS starts making economic and operational sense.

## Fact Sheet

- **System:** AI Platform — composite scenario based on real adoption patterns
- **Domain:** Generative AI / Cost Optimization
- **Models considered:** Claude 3 Sonnet/Haiku (Bedrock), Llama 3 70B / Mistral 7B (vLLM self-hosted)
- **GPU hardware evaluated:** g5.12xlarge (4× A10G, 96 GB VRAM) and p5.48xlarge (8× H100, 640 GB VRAM)
- **Token volume (baseline):** ~500 M tokens/month at initial phase; projection of 5–10 B tokens/month in 12 months
- **Compliance requirements:** LGPD; PII data must not leave AWS sa-east-1 region
- **Orchestration stack:** Amazon EKS 1.30, Karpenter, NVIDIA Device Plugin, vLLM 0.5.x, Prometheus + Grafana
- **Decision status:** Accepted — Bedrock for initial phase; migration to vLLM/EKS when volume exceeds break-even

## Context and Forces at Play

The platform started as an internal content automation and document analysis product, but quickly gained traction as a multi-tenant service offered to enterprise clients. Within six months, inference volume jumped from tens of thousands to hundreds of millions of tokens per month. At that point, the Bedrock invoice started appearing meaningfully in the P&L and the inevitable question arose: *is it worth bringing the model in-house?*

The central tension is not technical — it is economic and operational. On one side, Bedrock offers zero infrastructure overhead, frontier models (Claude, Titan, Llama via Marketplace), managed SLA, and per-token billing with no minimum commitment in on-demand mode. On the other, dedicated GPU instances have high fixed cost but near-zero marginal cost per token after amortization — which inverts the cost equation at high volumes.

There are, however, forces beyond cost. The engineering team has two experienced MLEs, but none with deep operational experience running GPU clusters in production. This matters: vLLM is an excellent inference engine — PagedAttention support, continuous batching, tensor parallelism — but it requires constant attention to CUDA driver versions, model compatibility with quantization (AWQ, GPTQ), tuning parameters like `--max-num-seqs`, `--gpu-memory-utilization`, and OOM management on large models. The learning curve cost is real and does not appear in any break-even spreadsheet.

Another relevant force is load heterogeneity. The platform has pronounced daytime peaks (peak-to-valley ratio of approximately 8:1) and nightly batch workloads. Bedrock absorbs peaks without manual sizing; EKS with Karpenter can scale out GPU nodes, but the cold start of a g5.12xlarge node — including container image pull with quantized model (~20–40 GB) — can take 8–15 minutes, which is unacceptable for interactive latency.

## Break-Even Analysis by Token Volume

The break-even analysis is the quantitative core of this decision. The numbers below are estimates based on public AWS prices (us-east-1, 2024 reference) and vLLM throughput benchmarks — they must be recalibrated with current prices and real load profile before any production decision.

**Bedrock (Claude 3 Haiku — on-demand):**
- Input: $0.00025 / 1K tokens; Output: $0.00125 / 1K tokens
- Assuming 70% input / 30% output mix: weighted average cost ≈ $0.000550 / 1K tokens
- At 1 B tokens/month: ~$550/month
- At 10 B tokens/month: ~$5,500/month

**Self-hosted vLLM (Llama 3 70B on g5.12xlarge):**
- g5.12xlarge on-demand: ~$5.67/hour → ~$4,082/month (730h)
- Typical throughput with vLLM + PagedAttention + AWQ int4: ~1,500–2,500 tokens/s under sustained load (estimate based on public vLLM benchmarks)
- At 2,000 tokens/s with 60% utilization: ~3.7 B tokens/month per instance
- Effective cost: ~$4,082 / 3.7 B ≈ $0.0011 / 1K tokens — still more expensive than Haiku
- With 1-year Reserved Instance (~40% discount): ~$2,449/month → ~$0.00066 / 1K tokens — approximate parity
- With Spot + Reserved mix and >80% utilization: cost can drop to ~$0.0003–0.0004 / 1K tokens

**Practical break-even (estimate):**
- Comparing Bedrock Haiku on-demand vs. g5.12xlarge Reserved + high utilization: break-even around **3–5 B tokens/month per instance**
- For larger models (Claude 3 Sonnet vs. Llama 3 70B full precision on p5): break-even rises significantly due to hardware cost
- Real break-even must include: MLOps engineering cost (~0.5–1 FTE), observability licenses, data egress, and downtime cost from operational incidents

**Quantitative conclusion:** below ~3 B tokens/month, Bedrock is economically superior even ignoring operational overhead. Above 5–8 B tokens/month with consistently >70% utilization, self-hosting starts to make sense — but only if the team has the operational maturity to sustain the environment.

## Decision Matrix: Options Evaluated

### Option A: Amazon Bedrock On-Demand

**Pros**
- Zero GPU infrastructure overhead — no drivers, no OOM, no CUDA upgrades
- Immediate access to frontier models (Claude 3.5, Llama 3.1 405B) without infra fine-tuning work
- Elastic billing — pay exactly for what you use, ideal for unpredictable volume
- AWS-managed SLA; no infra on-call burden for the product team
- Data processed within AWS region (sa-east-1 available for some models)

**Cons**
- High per-token cost at large volumes — economies of scale do not transfer to the customer
- No control over exact model version — updates may change behavior without notice
- Provisioned Throughput requires 1–6 month commitment without utilization guarantee
- Custom open-weight models (proprietary fine-tuning) have limited support

**Verdict:** Correct choice for initial phase (<3 B tokens/month) and for teams without GPU MLOps maturity

### Option B: EKS + vLLM on Dedicated GPUs (g5/p5)

**Pros**
- Very low marginal cost per token at high utilization — favorable break-even above 5 B tokens/month
- Full control over model version, quantization, context configuration, and fine-tuning
- vLLM offers continuous batching and PagedAttention — superior throughput to naive solutions
- Data never leaves the VPC — maximum compliance for PII and sensitive data
- Ability to use Spot Instances for batch workloads with additional 60–70% savings

**Cons**
- High fixed cost regardless of utilization — idle GPU instance still costs ~$4K/month
- GPU node cold start: 8–15 minutes for new Karpenter node with large model image
- Significant operational burden: driver upgrades, CUDA/PyTorch/vLLM compatibility, OOM tuning
- Model update requires new deployment rollout — not transparent like Bedrock
- Requires dedicated MLOps expertise — hidden cost of ~0.5–1 FTE for stable operation

**Verdict:** Justified above 5–8 B tokens/month with mature MLOps team and predictable load

### Option C: Hybrid — Bedrock + vLLM by Workload Type

**Pros**
- Bedrock for interactive peaks and frontier models; vLLM for predictable batch and open-weight models
- Optimizes cost per load segment without compromising interactive latency
- Allows building GPU operational maturity gradually without full production risk

**Cons**
- Routing complexity — requires abstraction layer (LLM gateway) to direct requests
- Two stacks to operate, monitor, and maintain — doubled observability overhead
- Behavioral consistency between different models can be difficult to guarantee

**Verdict:** Recommended evolutionary path for growing platforms above 2 B tokens/month

## Decision

**Status:** accepted

**Context**

The platform is in a growth phase with ~500 M tokens/month, a team of 2 MLEs without production GPU operational experience, LGPD data residency requirement in sa-east-1, and a projection of 5–10 B tokens/month in 12 months. Current Bedrock cost is manageable but the growth trajectory will make the invoice significant. The team needs time to build operational maturity before taking on the responsibility of a production GPU cluster.

**Decision**

Adopt Amazon Bedrock as the primary inference layer for the current phase (0–3 B tokens/month), with Provisioned Throughput for predictable base loads and on-demand for peaks. In parallel, build and validate the EKS + vLLM stack in a staging environment with real batch load. Migrate batch workloads to vLLM/EKS when: (a) batch volume consistently exceeds 2 B tokens/month, (b) the team completes 90 days of cluster operation in staging without critical incidents, and (c) projected Bedrock cost exceeds total EKS cost of ownership (including MLOps FTE) by a margin >30%. Maintain Bedrock as fallback and for frontier models without open-weight equivalent.

**Consequences**
- POSITIVE: Zero operational risk in initial phase — team focuses on product, not GPU infra
- POSITIVE: Clear and measurable migration path with objective trigger criteria
- POSITIVE: Provisioned Throughput reduces cost by ~30–40% for predictable base load vs on-demand
- NEGATIVE: Higher per-token cost than necessary if volume grows faster than projected
- NEGATIVE: Dependency on models and versions controlled by AWS/Anthropic — risk of breaking changes
- NEGATIVE: Requires LLM gateway from the start to abstract provider and facilitate future migration

## MLOps Operational Burden: The Cost That Doesn't Show Up in the Spreadsheet

Every break-even analysis I've seen in team discussions about LLM self-hosting makes the same mistake: it compares instance cost with API cost and stops there. This systematically underestimates the real cost of operating GPUs in production.

Let's be concrete about what it means to maintain vLLM on EKS in production. First, the compatibility chain: vLLM has strict dependencies between CUDA version, PyTorch version, and the NVIDIA driver on the host. An EKS node AMI upgrade can silently break inference if the driver is not compatible with the CUDA version compiled into the container. This is not hypothetical — it is a recurring pattern in projects I have followed. The NVIDIA Device Plugin for Kubernetes adds another configuration layer that needs to be managed.

Second, vLLM tuning is non-trivial. The parameters `--gpu-memory-utilization` (typically 0.85–0.90), `--max-num-seqs` (maximum number of in-flight sequences), `--max-model-len` (maximum context size), and the choice of quantization (AWQ vs. GPTQ vs. FP8) directly affect throughput, P99 latency, and OOM risk. A Llama 3 70B model in FP16 requires ~140 GB of VRAM — it does not fit in a single g5.12xlarge (96 GB). With AWQ int4, it drops to ~35–40 GB and fits comfortably, but generation quality may degrade on specific tasks. This decision needs to be empirically validated for each use case.

Third, cold start management is a product problem, not just an infra problem. Karpenter can provision a new GPU node in response to a pending pod, but the total time — EC2 provisioning + node bootstrap + container image pull + model loading into VRAM — is on the order of 8–15 minutes for large models. For interactive workloads, this is unacceptable. The solution is to maintain a minimum number of warm nodes (increasing fixed cost) or use node image caching (EKS Bottlerocket + EBS snapshot with pre-pulled image). Both approaches have additional cost and complexity.

Fourth, observability. Bedrock exposes metrics via CloudWatch without configuration. With vLLM, you need to instrument the Prometheus `/metrics` endpoint, configure GPU utilization dashboards (DCGM Exporter), VRAM alerts, tokens/s throughput, and per-percentile latency. It is not difficult, but it is work that needs to be done and maintained.

The conclusion is that the real operational cost of a vLLM cluster in production — when you include engineering time, incidents, upgrades, and observability — is equivalent to 0.5–1 senior engineer FTE. For a small team, this can be more expensive than the API cost difference you are trying to save.

## Estimated Cost Comparison by Token Volume
| Criterion | Monthly Volume | Bedrock Haiku On-Demand | Bedrock Provisioned Throughput | vLLM g5.12xl On-Demand | vLLM g5.12xl Reserved 1y |
| --- | --- | --- | --- | --- | --- |
| 500 M tokens | ~$275/month | ~$180/month (est.) | ~$4,082/month | ~$2,449/month | Bedrock |
| 2 B tokens | ~$1,100/month | ~$720/month (est.) | ~$4,082/month | ~$2,449/month | Bedrock PT |
| 5 B tokens | ~$2,750/month | ~$1,800/month (est.) | ~$4,082/month (1 inst.) | ~$2,449/month (1 inst.) | Hybrid / Evaluate |
| 10 B tokens | ~$5,500/month | ~$3,600/month (est.) | ~$8,164/month (2 inst.) | ~$4,898/month (2 inst.) | vLLM Reserved + FTE |

## Target Architecture: AI Platform with Hybrid Routing

Target architecture diagram after accepted decision: central LLM Gateway routing between Bedrock (interactive loads and frontier models) and vLLM/EKS (batch and open-weight models at high utilization). Includes observability layer, access control, and data residency.

### 👤 Clients

- Web App / API Clients (user)
- Batch Job Orchestrator (compute)

### 🔐 Security & Gateway

- API Gateway + WAF (security)
- LLM Gateway (LiteLLM / custom) (edge)
- Routing Logic model / load / cost (compute)

### ☁️ Bedrock (Managed)

- Bedrock On-Demand (ai)
- Bedrock Provisioned Throughput (ai)
- Claude 3 Sonnet / Haiku (ai)

### 🖥️ EKS + vLLM (Self-Hosted)

- EKS Service (vLLM endpoint) (compute)
- vLLM Pod Llama 3 70B AWQ (ai)
- Karpenter Node Provisioner (compute)
- g5.12xlarge 4× A10G GPU (compute)
- EBS Snapshot Model Image Cache (storage)

### 📊 Observability

- Prometheus + DCGM Exporter (data)
- Grafana GPU / Token Dashboards (frontend)
- CloudWatch Bedrock Metrics (data)

### 🗄️ Data & State

- ElastiCache Prompt Cache (Redis) (data)
- S3 Inference Logs (storage)

### Flows

- webapp -> apigw: HTTPS
- batchjob -> apigw: batch requests
- apigw -> llmgw: authn/authz
- llmgw -> cache: cache lookup
- llmgw -> router: route decision
- router -> bedrock_od: peaks / frontier
- router -> bedrock_pt: base load
- router -> eks_svc: batch / open-weight
- bedrock_od -> claude: API call
- bedrock_pt -> claude: provisioned
- eks_svc -> vllm_pod: k8s service
- vllm_pod -> gpu_node: CUDA / VRAM
- karpenter -> gpu_node: provision
- gpu_node -> ebs_snap: image cache
- vllm_pod -> prometheus: /metrics
- prometheus -> grafana
- bedrock_od -> cloudwatch: metrics
- llmgw -> s3_logs: audit log

## Well-Architected Assessment

- **security**: Bedrock processes data within AWS without public internet exposure; VPC Endpoint recommended to eliminate egress. vLLM on EKS should operate in isolated namespace with restrictive Network Policy and IRSA for S3/secrets access. Inference logs in S3 with SSE-KMS for LGPD audit. LLM Gateway is the central authn/authz point — must validate JWT and apply per-tenant rate limiting.
- **reliability**: Bedrock has managed SLA; the reliability risk is in the LLM Gateway (potential SPOF — must be multi-AZ with ALB). For vLLM/EKS, the main risk is OOM on long-context peaks — mitigated with conservative `--max-model-len` and HPA based on tokens/s. Automatic gateway fallback to Bedrock on vLLM endpoint failure is essential.
- **performance**: vLLM with PagedAttention and continuous batching offers superior throughput for batch; Bedrock is superior for interactive P50 latency without warm-up. Prompt cache in Redis (ElastiCache) can reduce 20–40% of calls for repetitive prompts (fixed system prompts, templates). Tensor parallelism in vLLM for models >70B requires multiple GPUs with NVLink — consider p4d/p5 for those cases.
- **cost**: Break-even between Bedrock and vLLM/EKS is at 3–5 B tokens/month for Haiku-equivalent with >70% utilization. Bedrock Provisioned Throughput reduces cost ~30–40% for predictable base load. Spot Instances for vLLM batch pods can reduce GPU cost by 60–70% with proper state checkpointing. Hidden MLOps cost (0.5–1 FTE) must be included in TCO.
- **sustainability**: Idle dedicated GPUs have a fixed carbon footprint regardless of utilization — >70% utilization is imperative for both cost and sustainability. Bedrock, being multi-tenant, has better energy efficiency per token at low volumes. Regions with renewable energy (us-west-2) should be considered for batch workloads without data residency requirements.

> **My Senior Perspective:** I've seen this discussion happen dozens of times in companies of different sizes, and the error pattern is always the same: the technical team builds a spreadsheet of instance cost vs. API cost, the number favors self-hosting at projected volumes, and the decision is made without considering the hidden denominator — the cost of operating GPUs in production with the quality a real product demands.

My position is clear: **unless you already have a team with proven operational experience in GPU clusters, start with Bedrock and invest the saved time in building the LLM Gateway correctly**. The Gateway is the most valuable asset in this architecture — it abstracts the provider, enables gradual migration, centralizes observability and cost control, and is what will allow you to move workloads to vLLM when the right moment comes without rewriting the application.

A specific mistake I see frequently: teams that choose self-hosting too early and size for the projected demand peak. Result: GPUs running at 20–30% utilization, real cost per token 3–4× higher than the Bedrock they were trying to avoid, and two senior engineers spending 30% of their time on infra maintenance instead of product.

The other side also exists: I've seen teams stay on Bedrock out of inertia well past the break-even point, paying 5–10× more than necessary. The discipline of reviewing the decision with objective criteria — like the ones I defined in this ADR — is what prevents both mistakes.

On compliance and LGPD: Bedrock in sa-east-1 solves data residency for most cases. If the requirement is that **no PII data leaves the VPC** (not just the region), then vLLM/EKS is mandatory regardless of volume — and that criterion must be explicit in the ADR, not implicit.

## Verdict

The correct decision between Bedrock and vLLM/EKS is not technical — it is a function of token volume, team operational maturity, and load profile. For volumes below 3 B tokens/month or teams without production GPU experience, Bedrock is economically and operationally superior without contest. The real break-even with vLLM/EKS — including MLOps FTE cost — is in the range of 5–8 B tokens/month with consistent utilization above 70%, and should not be calculated without these hidden costs.

The recommended architecture is not a binary choice: it is an LLM Gateway as a mandatory abstraction layer from day zero, Bedrock as the primary layer in the initial phase, and vLLM/EKS built and validated in parallel in staging. Migration happens when objective, measurable criteria are met — not when someone feels 'it's time'.

The most important investment in this decision is not in the choice of GPU or API: it is in the quality of the LLM Gateway. A well-built gateway — with model-based routing, per-tenant rate limiting, prompt caching, inference logging, and automatic fallback — is what transforms an infrastructure decision into a product competitive advantage. Without it, you are just trading one vendor lock-in for another.

## References

- [Amazon Bedrock — Pricing](https://aws.amazon.com/bedrock/pricing/)
- [vLLM — Documentation](https://docs.vllm.ai/)
- [Amazon EKS — User Guide](https://docs.aws.amazon.com/eks/latest/userguide/)

## Case sources

- [Amazon Bedrock — Pricing](https://aws.amazon.com/bedrock/pricing/)
- [vLLM — Documentation](https://docs.vllm.ai/)
- [Amazon EKS — User Guide](https://docs.aws.amazon.com/eks/latest/userguide/)
