Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

PlaybookAWS / Compute

Playbook: Where to Run AI on AWS — Lambda vs Fargate vs ECS/EKS (GPU)

Aug 20, 2025 9 min AI-assisted

Listen to study

generated on play

Generated only on first play

On demand

0:000:00

Speed

The MP3 is saved to S3 after the first play.

Choosing the wrong compute for AI workloads on AWS is the most expensive mistake teams make when scaling from prototype to production. This playbook maps Lambda, Fargate, ECS/EKS+GPU, and Bedrock against the axes that actually matter — duration, GPU, traffic pattern, and cost per token — and delivers an actionable decision tree for every scenario.

Every team starts with Lambda because it's serverless, then discovers a 7B model doesn't fit in 10 GB of RAM with a 15-minute timeout. The problem isn't the tool — it's using the right tool for the right layer. This playbook solves exactly that: which compute for which role in your AI system.

What you'll be able to decide after reading this

Lambda is for orchestration, glue, and calling Bedrock — not for hosting heavy models.

Fargate serves CPU-bound inference APIs with predictable traffic; it doesn't solve cheap dedicated GPU at scale.

ECS/EKS+GPU is the path for serving open-weights at volume — but you pay in ops, not in tokens.

Bedrock charges per token with zero ops; there's a volume where self-hosted becomes cheaper — do the math with YOUR traffic before deciding.

The hidden axis is total cost per token (infra + ops + engineering latency), not just instance price.

LLM agents live well in Lambda; the model the agent calls does not.

Scale References and Limits — AWS (2024/2025)

Lambda max timeout: 15 minutes
Lambda max RAM: 10 GB (no GPU)
AWS GPU instance families: p3, p4d, p5, g4dn, g5, g6, inf2, trn1
Bedrock — base model (Claude 3 Sonnet estimate): $3/M tokens input, $15/M tokens output (verify current pricing)
Fargate — max vCPU per task: 16 vCPU / 120 GB RAM (no native GPU)
EKS + g5.12xlarge (4x A10G 24GB): ~$16/h on-demand; ~$5-6/h spot (estimate)
Lambda cold start (container image): 1–5 s typical; can reach 10 s+ for large images

The mental model that unlocks everything: separate the agent from the model

The most common confusion I see in AI teams on AWS is treating the system as a monolithic block and trying to fit everything into the same compute. A modern AI system has at least two layers with completely different characteristics:

Orchestration / agent layer: receives the request, builds the prompt, calls tools (search, database, external APIs), manages conversation state, handles retries, logs. This layer is stateless between requests, executes in seconds to a few minutes, and the bottleneck is I/O — not CPU or GPU. Lambda was built for this. An agent waiting for a tool call response is exactly the async wait pattern Lambda handles well, especially with async invocation and Step Functions for longer flows.

Inference / model layer: loads weights (GBs to tens of GBs), runs forward pass on GPU or vectorized CPU, returns tokens. This layer is stateful in memory (weights stay loaded), has latency from hundreds of ms to minutes depending on the model, and the bottleneck is intensive compute. Lambda has no GPU, has a 10 GB RAM limit, and kills the process after 15 minutes — three strikes for any serious model.

When you mix these two layers in the same service, you pay the price of the most expensive layer on every request, including those that don't need it. Separate them. Lambda orchestrates. Bedrock, Fargate, or EKS+GPU serve the model. This separation is the foundation of any AI architecture that scales without invoice surprises.

The hidden axis: total cost per token, not instance price

When someone shows me a spreadsheet comparing Bedrock with self-hosted and concludes self-hosted is 10x cheaper, my first question is: did you include engineering cost?

Bedrock charges per token. You don't manage clusters, configure GPU autoscaling, deal with CUDA drivers, monitor VRAM utilization, or handle model version upgrades. For a team of 3 engineers building a product, the cost of 2 senior engineer weeks configuring EKS+GPU in production can exceed months of token price difference.

The math you need to do:

Total Bedrock cost = tokens × price/token

Total self-hosted cost = 
  (GPU hours × price/hour)
  + (engineering hours setup × cost/hour)
  + (engineering hours maintenance/month × cost/hour × months)
  + (cost of incident when cluster goes down at 3am)

The real break-even depends on your traffic, chosen model, and your team's MLOps maturity. For most early- to mid-stage products, Bedrock wins on total cost even though it's more expensive per token. The inflection point typically appears when you have high and predictable enough traffic to justify Reserved Instances or Savings Plans on GPU, and a dedicated platform team to absorb the operational overhead.

A conservative estimate: if you're generating fewer than 500 million tokens per month with a model equivalent to Claude Haiku or Llama 3 8B, the operational overhead of self-hosted likely negates the token savings. Above 2–5 billion tokens/month with stable traffic, the conversation changes. Do the math with your real numbers.

Decision Matrix: Lambda vs Fargate vs EKS+GPU vs Bedrock

AWS Lambda

Pros

Zero infra to manage; scales to zero automatically
Ideal for agent orchestration, glue code, calling Bedrock/external APIs
Zero cost when idle; great for unpredictable spikes
Native integration with Step Functions, SQS, EventBridge for async flows

Cons

No GPU; max RAM 10 GB; max timeout 15 minutes
Cold start can be a problem for P99 latency with large images
Cannot serve heavy models; cannot do local LLM inference
Per-invocation cost can surprise at very high and constant traffic

Use for orchestration, agent, glue, webhooks, calling Bedrock. Never for hosting models.

AWS Fargate (ECS/EKS)

Pros

Container serverless: no EC2 node management
No Lambda cold start; container stays warm while task is running
Good for CPU-bound inference API (small models, embeddings, reranking)
Task isolation per request; good security surface

Cons

No native GPU support (Fargate has no GPU instances)
More expensive than Lambda for very spiky traffic (pays per active task)
Slower scale-to-zero than Lambda; scale-out latency in minutes
For GPU you need ECS/EKS with EC2, not pure Fargate

Use for RAG workers, embedding APIs, CPU reranking services, AI microservices without GPU.

ECS / EKS + GPU (EC2)

Pros

Access to full AWS GPU family: g4dn, g5, p4d, p5, inf2, trn1
Lowest cost per token at high volume with Reserved Instances or Spot
Full control: model, version, quantization, batching strategy
Support for frameworks: vLLM, TGI, TensorRT-LLM, Triton

Cons

High operational overhead: cluster, CUDA drivers, GPU autoscaling, VRAM monitoring
High fixed cost even at low traffic (GPU instance running)
Not suitable for prototype or unpredictable traffic
Requires MLOps maturity: model CI/CD, rollback, A/B serving

Use when volume justifies operations: >500M tokens/month, stable traffic, platform team available.

Amazon Bedrock

Pros

Zero ops: no cluster, no driver, no autoscaling to manage
Access to frontier models (Claude, Titan, Llama, Mistral, Stable Diffusion)
Automatic scaling; no model cold start; AWS SLA
Low total cost for small and medium volumes when ops is included

Cons

Higher cost per token than self-hosted at very large volume
No model version control (AWS can deprecate); no arbitrary fine-tuning
Data leaves your VPC to Bedrock endpoint (consider PrivateLink)
Rate limits per account/region; can be bottleneck at high burst

Default for most products until volume justifies self-hosted. Combine with Lambda for orchestration.

Comparison by Practical AI Workload Axes

	Axis	Lambda	Fargate	EKS + GPU	Bedrock
Ideal traffic pattern	Spiky / unpredictable	Constant / predictable	High volume / stable	Any (AWS-managed)	—
Max execution duration	15 min (hard limit)	Unlimited (active task)	Unlimited (active pod)	Depends on model (streaming available)	—
GPU support	❌ No	❌ No (pure Fargate)	✅ Yes (g4dn, g5, p4d, p5, inf2, trn1)	✅ Yes (managed)	—
Cost at low traffic	⭐ Best (scales to zero)	Medium (minimum task running)	High (idle GPU = burning $)	⭐ Good (pay only what you use)	—
Cost at high volume	Medium-high (per invocation)	Medium (per vCPU/GB-hour)	⭐ Best (RI/Spot + batching)	High (token price doesn't drop with volume)	—
Operational overhead	⭐ Minimal	Low-medium	High (cluster, CUDA, autoscaling, VRAM)	⭐ Zero (AWS manages)	—
Cold start / warmup latency	1–10 s (container image)	30 s – 2 min (new task)	Minutes (new GPU node)	~100–500 ms (model already loaded)	—
Primary AI use case	Agent, orchestration, calling Bedrock	Embedding API, reranking, RAG worker	Serving open-weights LLM, batch inference	Frontier model inference, prototype, production without ops	—

When each layer makes sense: the real patterns

Pattern 1 — Agent with Bedrock (the most common and correct for 80% of cases)

Lambda receives the user request, builds context, calls bedrock:InvokeModel or bedrock-runtime:InvokeModelWithResponseStream, processes the response, calls tools if needed (other Lambdas, APIs, DynamoDB), and returns. For flows exceeding 15 minutes or with many steps, Step Functions Express Workflows are the natural complement. Cost: Lambda + Bedrock per token. Ops: zero. Time to market: days.

Pattern 2 — RAG Service with Fargate

A RAG service that receives a query, does embedding (small model like all-MiniLM or bge-small), searches OpenSearch/pgvector, does reranking, and returns context. This service runs well on Fargate with 4-8 vCPU and embedding models on CPU. An 80-400 MB embedding model fits comfortably in memory and doesn't need GPU for acceptable latency at moderate throughput. Cost: Fargate per vCPU/hour. Ops: low (ECS service with ALB and CPU-based autoscaling).

Pattern 3 — Self-hosted LLM on EKS+GPU

You have an open-weights model (Llama 3 70B, Mistral, Qwen) and enough traffic to justify the operation. You run vLLM or TGI in pods with nvidia.com/gpu: 1 on EKS, with Karpenter to provision GPU nodes on demand and scale to zero when traffic drops (caution: GPU scale-to-zero has 3-8 minute latency for a new node). You use Spot Instances for batch and On-Demand for real-time. Autoscaling is by custom metric (tokens/second, queue depth) not CPU. Cost: GPU hour + ops. Ops: high. Requires: CUDA drivers in AMI, node selectors/taints in Kubernetes, VRAM monitoring, graceful shutdown to avoid losing in-flight requests.

Pattern 4 — Async Batch Inference

Offline document processing, bulk embedding generation, dataset evaluation. Use SQS + Lambda to orchestrate, and ECS Tasks (not Fargate, use EC2 with GPU) to process each batch. Or use AWS Batch with GPU instances for long-running jobs. Bedrock also has a Batch Inference API for this case without managing anything.

How to Make the Decision: 7 Questions in Order

1
1. Are you serving a model or orchestrating calls?
If it's orchestration (agent, glue, router, webhook): Lambda or Step Functions. Stop here. If it's serving a model: continue.
2
2. Does the model need GPU?
Small embedding models (< 500M params), rerankers, and light classifiers run fine on CPU. Any generative LLM with >1B params needs GPU for acceptable production latency. If CPU: Fargate or Lambda (if it fits in 10 GB). If GPU: EKS+GPU or Bedrock.
3
3. Is the model proprietary/frontier or open-weights?
Claude, GPT-4, Titan, managed Stable Diffusion: use Bedrock (or provider's direct API). Llama, Mistral, Qwen, Falcon, your fine-tuned models: self-hosted on EKS+GPU.
4
4. What is your monthly token volume?
< 100M tokens/month: Bedrock almost always wins on total cost. 100M–1B: do the math (include ops). > 1B with stable traffic: self-hosted starts making financial sense. These are guideline thresholds — use your real numbers.
5
5. What is the traffic pattern?
Spiky/unpredictable (e.g., B2C product with viral spikes): Bedrock or Lambda+Bedrock. Constant and predictable (e.g., document processing pipeline): Fargate or EKS+GPU with Reserved Instances. Batch/offline: AWS Batch + GPU or Bedrock Batch API.
6
6. Does your team have operational capacity for GPU?
EKS+GPU requires: someone who understands Kubernetes, CUDA, custom metric autoscaling, VRAM monitoring, and is available for incidents at 3am. If you don't have that profile on the team today, don't start with EKS+GPU. Bedrock or SageMaker Endpoints are the way.
7
7. Does inference take more than 15 minutes?
Video generation, long document processing, embedding batch: don't use Lambda directly. Use SQS + Lambda to enqueue + ECS Task or AWS Batch to process. Or Step Functions with callback pattern for long flows with Lambda.

AI Compute Decision Tree on AWS

Decision flow from AI request to the correct compute. Read top to bottom following conditions.

🚦 Entry

AI Workload · Request

🔀 Decision Layer

Servindo modelo · ou orquestrando?
Precisa de GPU?
Open-weights · ou frontier?
Volume > 500M · tokens/mês?
Tráfego estável · e time de ops?
Duração · > 15 min?

✅ Compute Targets

Lambda · + Step Functions
Fargate ECS · CPU inference
Amazon Bedrock · Managed inference
EKS + GPU · (g5/p4d/inf2)
AWS Batch · + GPU / SQS

📦 Model Layer

Bedrock API · Claude/Titan/Llama
vLLM / TGI · open-weights
Embedding Model · CPU (MiniLM/BGE)

Anti-pattern: Dumping everything into Lambda 'because it's serverless'

This is the most frequent mistake I see in teams starting with AI on AWS. The reasoning is: 'Lambda is serverless, scales automatically, I don't need to manage a server — I'll put the model there too.' The result is a Lambda with an 8 GB image trying to load a 4 GB model into RAM, 30-second cold starts, timeouts blowing up on long requests, and absurd per-invocation cost because the function sits at the memory limit. What happens in practice: - A 7B model quantized to INT4 needs ~4 GB of VRAM. Lambda has no VRAM. On CPU, it needs ~4-8 GB of RAM and takes 30-120 seconds to generate 200 tokens. That's not production. - Lambda cold start with an 8 GB image can take 10-15 seconds just to initialize the container, before loading the model. - The 15-minute timeout seems like a lot until you have a user with a long prompt and 32K token context. - You pay per GB-second. A 10 GB Lambda running for 5 minutes costs ~$0.083 per invocation — compare that to a Bedrock Claude Haiku call that costs fractions of a cent. The rule: Lambda is for calling the model, not for being the model. Use Lambda to orchestrate, do glue, call Bedrock or your inference endpoint. The model lives somewhere else.

Rule of Thumb

Lambda orchestrates. Bedrock infers. EKS+GPU scales at volume. Fargate fills the middle. Or more directly: if you're writing code that calls a model, use Lambda. If you're writing code that IS the model, use something else. And before building a GPU cluster, do the total cost math — token price × volume versus (GPU hour × hours) + (engineer × hours). In most cases up to 500M tokens/month, Bedrock wins when you include ops in the equation.

My Senior Take

Senior Solutions Architect

After 16 years building distributed systems — including financial platforms where every cent of operational cost is audited — what strikes me about AI projects is how experienced teams repeat the 2012 microservices mistakes: they choose the most 'powerful' technology before understanding the access pattern. My default approach today: start with Lambda + Bedrock for anything new. Zero ops, time to market in days, cost proportional to real usage. When the product finds traction and token volume starts showing up significantly on the bill, then I do the break-even analysis with the system's real numbers — not spreadsheet estimates. For most products I've followed, that break-even moment comes after significant scale, and when it arrives, the team already has the operational maturity to handle EKS+GPU responsibly. Trying to skip that step 'to save money' usually results in spending more on engineering than would have been spent on tokens. The only case where I jump straight to self-hosted is when there's a regulatory requirement that data cannot leave the controlled environment (e.g., health data, financial with sensitive PII) and PrivateLink + Bedrock isn't sufficient for compliance. In that case, EKS+GPU with an open-weights model inside the VPC is the answer — but the decision is compliance-driven, not cost-driven. One thing I rarely see in architecture documents: the engineering time cost of configuring and maintaining GPU in production is real and recurring. Put that in the spreadsheet before presenting the decision.

Verdict

There is no correct AI compute — there is correct compute for the role, the volume, and your team's operational maturity today. Lambda is the best agent orchestrator on AWS; it's a terrible model server. Bedrock is the rational choice for most products until significant scale, because the real cost includes ops, not just tokens. EKS+GPU is powerful and expensive to operate — use it when volume justifies it and the team is ready, not as a first step. Fargate fills the middle space for CPU-bound AI services that need more than Lambda offers without GPU complexity. The right decision is the one you can operate, scale, and debug at 3am — not the one that looks most impressive in the diagram.

References

AWS Lambda — Product Page AWS Fargate — Product Page Amazon EKS — Product Page Amazon Bedrock — Pricing AWS — GPU / Accelerated Computing Instance Types

#aws#lambda#fargate#eks#gpu#inference#bedrock#ai-compute

Case sources

AWS Lambda AWS Fargate Amazon EKS Amazon Bedrock — Pricing AWS — GPU instances (Accelerated computing)

Liked this study? Get the next one.

Post-mortems, ADRs and architecture deep dives in your inbox — the way an architect reads them.

No spam · unsubscribe anytime

Written with AI assistance from the public case and my architect's reading.

Ask Fernando about this

Get a focused answer about this study from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

PlaybookAWS / Compute

Playbook: Where to Run AI on AWS — Lambda vs Fargate vs ECS/EKS (GPU)

Aug 20, 2025 9 min AI-assisted

Listen to study

generated on play

Generated only on first play

On demand

0:000:00

Speed

The MP3 is saved to S3 after the first play.

What you'll be able to decide after reading this

Lambda is for orchestration, glue, and calling Bedrock — not for hosting heavy models.

Fargate serves CPU-bound inference APIs with predictable traffic; it doesn't solve cheap dedicated GPU at scale.

ECS/EKS+GPU is the path for serving open-weights at volume — but you pay in ops, not in tokens.

Bedrock charges per token with zero ops; there's a volume where self-hosted becomes cheaper — do the math with YOUR traffic before deciding.

The hidden axis is total cost per token (infra + ops + engineering latency), not just instance price.

LLM agents live well in Lambda; the model the agent calls does not.

Scale References and Limits — AWS (2024/2025)

Lambda max timeout: 15 minutes
Lambda max RAM: 10 GB (no GPU)
AWS GPU instance families: p3, p4d, p5, g4dn, g5, g6, inf2, trn1
Bedrock — base model (Claude 3 Sonnet estimate): $3/M tokens input, $15/M tokens output (verify current pricing)
Fargate — max vCPU per task: 16 vCPU / 120 GB RAM (no native GPU)
EKS + g5.12xlarge (4x A10G 24GB): ~$16/h on-demand; ~$5-6/h spot (estimate)
Lambda cold start (container image): 1–5 s typical; can reach 10 s+ for large images

The mental model that unlocks everything: separate the agent from the model

The hidden axis: total cost per token, not instance price

When someone shows me a spreadsheet comparing Bedrock with self-hosted and concludes self-hosted is 10x cheaper, my first question is: did you include engineering cost?

The math you need to do:

Total Bedrock cost = tokens × price/token

Total self-hosted cost = 
  (GPU hours × price/hour)
  + (engineering hours setup × cost/hour)
  + (engineering hours maintenance/month × cost/hour × months)
  + (cost of incident when cluster goes down at 3am)

Decision Matrix: Lambda vs Fargate vs EKS+GPU vs Bedrock

AWS Lambda

Pros

Zero infra to manage; scales to zero automatically
Ideal for agent orchestration, glue code, calling Bedrock/external APIs
Zero cost when idle; great for unpredictable spikes
Native integration with Step Functions, SQS, EventBridge for async flows

Cons

No GPU; max RAM 10 GB; max timeout 15 minutes
Cold start can be a problem for P99 latency with large images
Cannot serve heavy models; cannot do local LLM inference
Per-invocation cost can surprise at very high and constant traffic

Use for orchestration, agent, glue, webhooks, calling Bedrock. Never for hosting models.

AWS Fargate (ECS/EKS)

Pros

Container serverless: no EC2 node management
No Lambda cold start; container stays warm while task is running
Good for CPU-bound inference API (small models, embeddings, reranking)
Task isolation per request; good security surface

Cons

No native GPU support (Fargate has no GPU instances)
More expensive than Lambda for very spiky traffic (pays per active task)
Slower scale-to-zero than Lambda; scale-out latency in minutes
For GPU you need ECS/EKS with EC2, not pure Fargate

Use for RAG workers, embedding APIs, CPU reranking services, AI microservices without GPU.

ECS / EKS + GPU (EC2)

Pros

Access to full AWS GPU family: g4dn, g5, p4d, p5, inf2, trn1
Lowest cost per token at high volume with Reserved Instances or Spot
Full control: model, version, quantization, batching strategy
Support for frameworks: vLLM, TGI, TensorRT-LLM, Triton

Cons

High operational overhead: cluster, CUDA drivers, GPU autoscaling, VRAM monitoring
High fixed cost even at low traffic (GPU instance running)
Not suitable for prototype or unpredictable traffic
Requires MLOps maturity: model CI/CD, rollback, A/B serving

Use when volume justifies operations: >500M tokens/month, stable traffic, platform team available.

Amazon Bedrock

Pros

Zero ops: no cluster, no driver, no autoscaling to manage
Access to frontier models (Claude, Titan, Llama, Mistral, Stable Diffusion)
Automatic scaling; no model cold start; AWS SLA
Low total cost for small and medium volumes when ops is included

Cons

Higher cost per token than self-hosted at very large volume
No model version control (AWS can deprecate); no arbitrary fine-tuning
Data leaves your VPC to Bedrock endpoint (consider PrivateLink)
Rate limits per account/region; can be bottleneck at high burst

Default for most products until volume justifies self-hosted. Combine with Lambda for orchestration.

Comparison by Practical AI Workload Axes

	Axis	Lambda	Fargate	EKS + GPU	Bedrock
Ideal traffic pattern	Spiky / unpredictable	Constant / predictable	High volume / stable	Any (AWS-managed)	—
Max execution duration	15 min (hard limit)	Unlimited (active task)	Unlimited (active pod)	Depends on model (streaming available)	—
GPU support	❌ No	❌ No (pure Fargate)	✅ Yes (g4dn, g5, p4d, p5, inf2, trn1)	✅ Yes (managed)	—
Cost at low traffic	⭐ Best (scales to zero)	Medium (minimum task running)	High (idle GPU = burning $)	⭐ Good (pay only what you use)	—
Cost at high volume	Medium-high (per invocation)	Medium (per vCPU/GB-hour)	⭐ Best (RI/Spot + batching)	High (token price doesn't drop with volume)	—
Operational overhead	⭐ Minimal	Low-medium	High (cluster, CUDA, autoscaling, VRAM)	⭐ Zero (AWS manages)	—
Cold start / warmup latency	1–10 s (container image)	30 s – 2 min (new task)	Minutes (new GPU node)	~100–500 ms (model already loaded)	—
Primary AI use case	Agent, orchestration, calling Bedrock	Embedding API, reranking, RAG worker	Serving open-weights LLM, batch inference	Frontier model inference, prototype, production without ops	—

When each layer makes sense: the real patterns

Pattern 1 — Agent with Bedrock (the most common and correct for 80% of cases)

Pattern 2 — RAG Service with Fargate

Pattern 3 — Self-hosted LLM on EKS+GPU

Pattern 4 — Async Batch Inference

How to Make the Decision: 7 Questions in Order

1
1. Are you serving a model or orchestrating calls?
If it's orchestration (agent, glue, router, webhook): Lambda or Step Functions. Stop here. If it's serving a model: continue.
2
2. Does the model need GPU?
Small embedding models (< 500M params), rerankers, and light classifiers run fine on CPU. Any generative LLM with >1B params needs GPU for acceptable production latency. If CPU: Fargate or Lambda (if it fits in 10 GB). If GPU: EKS+GPU or Bedrock.
3
3. Is the model proprietary/frontier or open-weights?
Claude, GPT-4, Titan, managed Stable Diffusion: use Bedrock (or provider's direct API). Llama, Mistral, Qwen, Falcon, your fine-tuned models: self-hosted on EKS+GPU.
4
4. What is your monthly token volume?
< 100M tokens/month: Bedrock almost always wins on total cost. 100M–1B: do the math (include ops). > 1B with stable traffic: self-hosted starts making financial sense. These are guideline thresholds — use your real numbers.
5
5. What is the traffic pattern?
Spiky/unpredictable (e.g., B2C product with viral spikes): Bedrock or Lambda+Bedrock. Constant and predictable (e.g., document processing pipeline): Fargate or EKS+GPU with Reserved Instances. Batch/offline: AWS Batch + GPU or Bedrock Batch API.
6
6. Does your team have operational capacity for GPU?
EKS+GPU requires: someone who understands Kubernetes, CUDA, custom metric autoscaling, VRAM monitoring, and is available for incidents at 3am. If you don't have that profile on the team today, don't start with EKS+GPU. Bedrock or SageMaker Endpoints are the way.
7
7. Does inference take more than 15 minutes?
Video generation, long document processing, embedding batch: don't use Lambda directly. Use SQS + Lambda to enqueue + ECS Task or AWS Batch to process. Or Step Functions with callback pattern for long flows with Lambda.

AI Compute Decision Tree on AWS

Decision flow from AI request to the correct compute. Read top to bottom following conditions.

🚦 Entry

AI Workload · Request

🔀 Decision Layer

Servindo modelo · ou orquestrando?
Precisa de GPU?
Open-weights · ou frontier?
Volume > 500M · tokens/mês?
Tráfego estável · e time de ops?
Duração · > 15 min?

✅ Compute Targets

Lambda · + Step Functions
Fargate ECS · CPU inference
Amazon Bedrock · Managed inference
EKS + GPU · (g5/p4d/inf2)
AWS Batch · + GPU / SQS

📦 Model Layer

Bedrock API · Claude/Titan/Llama
vLLM / TGI · open-weights
Embedding Model · CPU (MiniLM/BGE)

Anti-pattern: Dumping everything into Lambda 'because it's serverless'

Rule of Thumb

My Senior Take

Senior Solutions Architect

Verdict

References

AWS Lambda — Product Page AWS Fargate — Product Page Amazon EKS — Product Page Amazon Bedrock — Pricing AWS — GPU / Accelerated Computing Instance Types

#aws#lambda#fargate#eks#gpu#inference#bedrock#ai-compute

Case sources

AWS Lambda AWS Fargate Amazon EKS Amazon Bedrock — Pricing AWS — GPU instances (Accelerated computing)

Liked this study? Get the next one.

Post-mortems, ADRs and architecture deep dives in your inbox — the way an architect reads them.

No spam · unsubscribe anytime

Written with AI assistance from the public case and my architect's reading.

Ask Fernando about this

Get a focused answer about this study from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Listen to study

What you'll be able to decide after reading this

Scale References and Limits — AWS (2024/2025)

The mental model that unlocks everything: separate the agent from the model

The hidden axis: total cost per token, not instance price

Decision Matrix: Lambda vs Fargate vs EKS+GPU vs Bedrock

AWS Lambda

AWS Fargate (ECS/EKS)

ECS / EKS + GPU (EC2)

Amazon Bedrock

Comparison by Practical AI Workload Axes

When each layer makes sense: the real patterns

How to Make the Decision: 7 Questions in Order

1. Are you serving a model or orchestrating calls?

2. Does the model need GPU?

3. Is the model proprietary/frontier or open-weights?

4. What is your monthly token volume?

5. What is the traffic pattern?

6. Does your team have operational capacity for GPU?

7. Does inference take more than 15 minutes?

AI Compute Decision Tree on AWS

Anti-pattern: Dumping everything into Lambda 'because it's serverless'

Rule of Thumb

Verdict

References

Ask Fernando about this

Join the conversation

Listen to study

What you'll be able to decide after reading this

Scale References and Limits — AWS (2024/2025)

The mental model that unlocks everything: separate the agent from the model

The hidden axis: total cost per token, not instance price

Decision Matrix: Lambda vs Fargate vs EKS+GPU vs Bedrock

AWS Lambda

AWS Fargate (ECS/EKS)

ECS / EKS + GPU (EC2)

Amazon Bedrock

Comparison by Practical AI Workload Axes

When each layer makes sense: the real patterns

How to Make the Decision: 7 Questions in Order

1. Are you serving a model or orchestrating calls?

2. Does the model need GPU?

3. Is the model proprietary/frontier or open-weights?

4. What is your monthly token volume?

5. What is the traffic pattern?

6. Does your team have operational capacity for GPU?

7. Does inference take more than 15 minutes?

AI Compute Decision Tree on AWS

Anti-pattern: Dumping everything into Lambda 'because it's serverless'

Rule of Thumb

Verdict

References

Ask Fernando about this

Join the conversation