Playbook: Where to Run AI on AWS — Lambda vs Fargate vs ECS/EKS (GPU)
Listen to study
generated on playGenerated only on first play
Powered by Amazon Polly + OmniVoice
Choosing the wrong compute for AI workloads on AWS is the most expensive mistake teams make when scaling from prototype to production. This playbook maps Lambda, Fargate, ECS/EKS+GPU, and Bedrock against the axes that actually matter — duration, GPU, traffic pattern, and cost per token — and delivers an actionable decision tree for every scenario.
Every team starts with Lambda because it's serverless, then discovers a 7B model doesn't fit in 10 GB of RAM with a 15-minute timeout. The problem isn't the tool — it's using the right tool for the right layer. This playbook solves exactly that: which compute for which role in your AI system.
What you'll be able to decide after reading this
Scale References and Limits — AWS (2024/2025)
- Lambda max timeout
- 15 minutes
- Lambda max RAM
- 10 GB (no GPU)
- AWS GPU instance families
- p3, p4d, p5, g4dn, g5, g6, inf2, trn1
- Bedrock — base model (Claude 3 Sonnet estimate)
- $3/M tokens input, $15/M tokens output (verify current pricing)
- Fargate — max vCPU per task
- 16 vCPU / 120 GB RAM (no native GPU)
- EKS + g5.12xlarge (4x A10G 24GB)
- ~$16/h on-demand; ~$5-6/h spot (estimate)
- Lambda cold start (container image)
- 1–5 s typical; can reach 10 s+ for large images
The mental model that unlocks everything: separate the agent from the model
The most common confusion I see in AI teams on AWS is treating the system as a monolithic block and trying to fit everything into the same compute. A modern AI system has at least two layers with completely different characteristics:
Orchestration / agent layer: receives the request, builds the prompt, calls tools (search, database, external APIs), manages conversation state, handles retries, logs. This layer is stateless between requests, executes in seconds to a few minutes, and the bottleneck is I/O — not CPU or GPU. Lambda was built for this. An agent waiting for a tool call response is exactly the async wait pattern Lambda handles well, especially with async invocation and Step Functions for longer flows.
Inference / model layer: loads weights (GBs to tens of GBs), runs forward pass on GPU or vectorized CPU, returns tokens. This layer is stateful in memory (weights stay loaded), has latency from hundreds of ms to minutes depending on the model, and the bottleneck is intensive compute. Lambda has no GPU, has a 10 GB RAM limit, and kills the process after 15 minutes — three strikes for any serious model.
When you mix these two layers in the same service, you pay the price of the most expensive layer on every request, including those that don't need it. Separate them. Lambda orchestrates. Bedrock, Fargate, or EKS+GPU serve the model. This separation is the foundation of any AI architecture that scales without invoice surprises.
The hidden axis: total cost per token, not instance price
When someone shows me a spreadsheet comparing Bedrock with self-hosted and concludes self-hosted is 10x cheaper, my first question is: did you include engineering cost?
Bedrock charges per token. You don't manage clusters, configure GPU autoscaling, deal with CUDA drivers, monitor VRAM utilization, or handle model version upgrades. For a team of 3 engineers building a product, the cost of 2 senior engineer weeks configuring EKS+GPU in production can exceed months of token price difference.
The math you need to do:
Total Bedrock cost = tokens × price/token
Total self-hosted cost =
(GPU hours × price/hour)
+ (engineering hours setup × cost/hour)
+ (engineering hours maintenance/month × cost/hour × months)
+ (cost of incident when cluster goes down at 3am)
The real break-even depends on your traffic, chosen model, and your team's MLOps maturity. For most early- to mid-stage products, Bedrock wins on total cost even though it's more expensive per token. The inflection point typically appears when you have high and predictable enough traffic to justify Reserved Instances or Savings Plans on GPU, and a dedicated platform team to absorb the operational overhead.
A conservative estimate: if you're generating fewer than 500 million tokens per month with a model equivalent to Claude Haiku or Llama 3 8B, the operational overhead of self-hosted likely negates the token savings. Above 2–5 billion tokens/month with stable traffic, the conversation changes. Do the math with your real numbers.
Decision Matrix: Lambda vs Fargate vs EKS+GPU vs Bedrock
AWS Lambda
- Zero infra to manage; scales to zero automatically
- Ideal for agent orchestration, glue code, calling Bedrock/external APIs
- Zero cost when idle; great for unpredictable spikes
- Native integration with Step Functions, SQS, EventBridge for async flows
- No GPU; max RAM 10 GB; max timeout 15 minutes
- Cold start can be a problem for P99 latency with large images
- Cannot serve heavy models; cannot do local LLM inference
- Per-invocation cost can surprise at very high and constant traffic
Use for orchestration, agent, glue, webhooks, calling Bedrock. Never for hosting models.
AWS Fargate (ECS/EKS)
- Container serverless: no EC2 node management
- No Lambda cold start; container stays warm while task is running
- Good for CPU-bound inference API (small models, embeddings, reranking)
- Task isolation per request; good security surface
- No native GPU support (Fargate has no GPU instances)
- More expensive than Lambda for very spiky traffic (pays per active task)
- Slower scale-to-zero than Lambda; scale-out latency in minutes
- For GPU you need ECS/EKS with EC2, not pure Fargate
Use for RAG workers, embedding APIs, CPU reranking services, AI microservices without GPU.
ECS / EKS + GPU (EC2)
- Access to full AWS GPU family: g4dn, g5, p4d, p5, inf2, trn1
- Lowest cost per token at high volume with Reserved Instances or Spot
- Full control: model, version, quantization, batching strategy
- Support for frameworks: vLLM, TGI, TensorRT-LLM, Triton
- High operational overhead: cluster, CUDA drivers, GPU autoscaling, VRAM monitoring
- High fixed cost even at low traffic (GPU instance running)
- Not suitable for prototype or unpredictable traffic
- Requires MLOps maturity: model CI/CD, rollback, A/B serving
Use when volume justifies operations: >500M tokens/month, stable traffic, platform team available.
Amazon Bedrock
- Zero ops: no cluster, no driver, no autoscaling to manage
- Access to frontier models (Claude, Titan, Llama, Mistral, Stable Diffusion)
- Automatic scaling; no model cold start; AWS SLA
- Low total cost for small and medium volumes when ops is included
- Higher cost per token than self-hosted at very large volume
- No model version control (AWS can deprecate); no arbitrary fine-tuning
- Data leaves your VPC to Bedrock endpoint (consider PrivateLink)
- Rate limits per account/region; can be bottleneck at high burst
Default for most products until volume justifies self-hosted. Combine with Lambda for orchestration.
Comparison by Practical AI Workload Axes
| Axis | Lambda | Fargate | EKS + GPU | Bedrock | |
|---|---|---|---|---|---|
| Ideal traffic pattern | Spiky / unpredictable | Constant / predictable | High volume / stable | Any (AWS-managed) | — |
| Max execution duration | 15 min (hard limit) | Unlimited (active task) | Unlimited (active pod) | Depends on model (streaming available) | — |
| GPU support | ❌ No | ❌ No (pure Fargate) | ✅ Yes (g4dn, g5, p4d, p5, inf2, trn1) | ✅ Yes (managed) | — |
| Cost at low traffic | ⭐ Best (scales to zero) | Medium (minimum task running) | High (idle GPU = burning $) | ⭐ Good (pay only what you use) | — |
| Cost at high volume | Medium-high (per invocation) | Medium (per vCPU/GB-hour) | ⭐ Best (RI/Spot + batching) | High (token price doesn't drop with volume) | — |
| Operational overhead | ⭐ Minimal | Low-medium | High (cluster, CUDA, autoscaling, VRAM) | ⭐ Zero (AWS manages) | — |
| Cold start / warmup latency | 1–10 s (container image) | 30 s – 2 min (new task) | Minutes (new GPU node) | ~100–500 ms (model already loaded) | — |
| Primary AI use case | Agent, orchestration, calling Bedrock | Embedding API, reranking, RAG worker | Serving open-weights LLM, batch inference | Frontier model inference, prototype, production without ops | — |
When each layer makes sense: the real patterns
Pattern 1 — Agent with Bedrock (the most common and correct for 80% of cases)
Lambda receives the user request, builds context, calls bedrock:InvokeModel or bedrock-runtime:InvokeModelWithResponseStream, processes the response, calls tools if needed (other Lambdas, APIs, DynamoDB), and returns. For flows exceeding 15 minutes or with many steps, Step Functions Express Workflows are the natural complement. Cost: Lambda + Bedrock per token. Ops: zero. Time to market: days.
Pattern 2 — RAG Service with Fargate
A RAG service that receives a query, does embedding (small model like all-MiniLM or bge-small), searches OpenSearch/pgvector, does reranking, and returns context. This service runs well on Fargate with 4-8 vCPU and embedding models on CPU. An 80-400 MB embedding model fits comfortably in memory and doesn't need GPU for acceptable latency at moderate throughput. Cost: Fargate per vCPU/hour. Ops: low (ECS service with ALB and CPU-based autoscaling).
Pattern 3 — Self-hosted LLM on EKS+GPU
You have an open-weights model (Llama 3 70B, Mistral, Qwen) and enough traffic to justify the operation. You run vLLM or TGI in pods with nvidia.com/gpu: 1 on EKS, with Karpenter to provision GPU nodes on demand and scale to zero when traffic drops (caution: GPU scale-to-zero has 3-8 minute latency for a new node). You use Spot Instances for batch and On-Demand for real-time. Autoscaling is by custom metric (tokens/second, queue depth) not CPU. Cost: GPU hour + ops. Ops: high. Requires: CUDA drivers in AMI, node selectors/taints in Kubernetes, VRAM monitoring, graceful shutdown to avoid losing in-flight requests.
Pattern 4 — Async Batch Inference
Offline document processing, bulk embedding generation, dataset evaluation. Use SQS + Lambda to orchestrate, and ECS Tasks (not Fargate, use EC2 with GPU) to process each batch. Or use AWS Batch with GPU instances for long-running jobs. Bedrock also has a Batch Inference API for this case without managing anything.
How to Make the Decision: 7 Questions in Order
- 1
1. Are you serving a model or orchestrating calls?
If it's orchestration (agent, glue, router, webhook): Lambda or Step Functions. Stop here. If it's serving a model: continue.
- 2
2. Does the model need GPU?
Small embedding models (< 500M params), rerankers, and light classifiers run fine on CPU. Any generative LLM with >1B params needs GPU for acceptable production latency. If CPU: Fargate or Lambda (if it fits in 10 GB). If GPU: EKS+GPU or Bedrock.
- 3
3. Is the model proprietary/frontier or open-weights?
Claude, GPT-4, Titan, managed Stable Diffusion: use Bedrock (or provider's direct API). Llama, Mistral, Qwen, Falcon, your fine-tuned models: self-hosted on EKS+GPU.
- 4
4. What is your monthly token volume?
< 100M tokens/month: Bedrock almost always wins on total cost. 100M–1B: do the math (include ops). > 1B with stable traffic: self-hosted starts making financial sense. These are guideline thresholds — use your real numbers.
- 5
5. What is the traffic pattern?
Spiky/unpredictable (e.g., B2C product with viral spikes): Bedrock or Lambda+Bedrock. Constant and predictable (e.g., document processing pipeline): Fargate or EKS+GPU with Reserved Instances. Batch/offline: AWS Batch + GPU or Bedrock Batch API.
- 6
6. Does your team have operational capacity for GPU?
EKS+GPU requires: someone who understands Kubernetes, CUDA, custom metric autoscaling, VRAM monitoring, and is available for incidents at 3am. If you don't have that profile on the team today, don't start with EKS+GPU. Bedrock or SageMaker Endpoints are the way.
- 7
7. Does inference take more than 15 minutes?
Video generation, long document processing, embedding batch: don't use Lambda directly. Use SQS + Lambda to enqueue + ECS Task or AWS Batch to process. Or Step Functions with callback pattern for long flows with Lambda.
AI Compute Decision Tree on AWS
Decision flow from AI request to the correct compute. Read top to bottom following conditions.
- AI Workload · Request
- Servindo modelo · ou orquestrando?
- Precisa de GPU?
- Open-weights · ou frontier?
- Volume > 500M · tokens/mês?
- Tráfego estável · e time de ops?
- Duração · > 15 min?
- Lambda · + Step Functions
- Fargate ECS · CPU inference
- Amazon Bedrock · Managed inference
- EKS + GPU · (g5/p4d/inf2)
- AWS Batch · + GPU / SQS
- Bedrock API · Claude/Titan/Llama
- vLLM / TGI · open-weights
- Embedding Model · CPU (MiniLM/BGE)
Anti-pattern: Dumping everything into Lambda 'because it's serverless'
This is the most frequent mistake I see in teams starting with AI on AWS. The reasoning is: 'Lambda is serverless, scales automatically, I don't need to manage a server — I'll put the model there too.' The result is a Lambda with an 8 GB image trying to load a 4 GB model into RAM, 30-second cold starts, timeouts blowing up on long requests, and absurd per-invocation cost because the function sits at the memory limit. What happens in practice: - A 7B model quantized to INT4 needs ~4 GB of VRAM. Lambda has no VRAM. On CPU, it needs ~4-8 GB of RAM and takes 30-120 seconds to generate 200 tokens. That's not production. - Lambda cold start with an 8 GB image can take 10-15 seconds just to initialize the container, before loading the model. - The 15-minute timeout seems like a lot until you have a user with a long prompt and 32K token context. - You pay per GB-second. A 10 GB Lambda running for 5 minutes costs ~$0.083 per invocation — compare that to a Bedrock Claude Haiku call that costs fractions of a cent. The rule: Lambda is for calling the model, not for being the model. Use Lambda to orchestrate, do glue, call Bedrock or your inference endpoint. The model lives somewhere else.
Rule of Thumb
Lambda orchestrates. Bedrock infers. EKS+GPU scales at volume. Fargate fills the middle. Or more directly: if you're writing code that calls a model, use Lambda. If you're writing code that IS the model, use something else. And before building a GPU cluster, do the total cost math — token price × volume versus (GPU hour × hours) + (engineer × hours). In most cases up to 500M tokens/month, Bedrock wins when you include ops in the equation.
After 16 years building distributed systems — including financial platforms where every cent of operational cost is audited — what strikes me about AI projects is how experienced teams repeat the 2012 microservices mistakes: they choose the most 'powerful' technology before understanding the access pattern. My default approach today: start with Lambda + Bedrock for anything new. Zero ops, time to market in days, cost proportional to real usage. When the product finds traction and token volume starts showing up significantly on the bill, then I do the break-even analysis with the system's real numbers — not spreadsheet estimates. For most products I've followed, that break-even moment comes after significant scale, and when it arrives, the team already has the operational maturity to handle EKS+GPU responsibly. Trying to skip that step 'to save money' usually results in spending more on engineering than would have been spent on tokens. The only case where I jump straight to self-hosted is when there's a regulatory requirement that data cannot leave the controlled environment (e.g., health data, financial with sensitive PII) and PrivateLink + Bedrock isn't sufficient for compliance. In that case, EKS+GPU with an open-weights model inside the VPC is the answer — but the decision is compliance-driven, not cost-driven. One thing I rarely see in architecture documents: the engineering time cost of configuring and maintaining GPU in production is real and recurring. Put that in the spreadsheet before presenting the decision.
Verdict
There is no correct AI compute — there is correct compute for the role, the volume, and your team's operational maturity today. Lambda is the best agent orchestrator on AWS; it's a terrible model server. Bedrock is the rational choice for most products until significant scale, because the real cost includes ops, not just tokens. EKS+GPU is powerful and expensive to operate — use it when volume justifies it and the team is ready, not as a first step. Fargate fills the middle space for CPU-bound AI services that need more than Lambda offers without GPU complexity. The right decision is the one you can operate, scale, and debug at 3am — not the one that looks most impressive in the diagram.
Post-mortems, ADRs and architecture deep dives in your inbox — the way an architect reads them.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this study from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.