Studies
Design Doc / RFCPlataforma multi-tenant (cenário)Kubernetes / Custo

Design Doc: EKS Multi-Tenant at Scale — Karpenter, Isolation, and Cost

Mar 6, 2026 10 min AI-assisted
Share:

Listen to study

generated on play

Generated only on first play

On demand
0:000:00
Speed
The MP3 is saved to S3 after the first play.

This document describes the architecture of a multi-tenant EKS platform for dozens of teams, covering workload isolation via namespaces, network policies, and RBAC, autoscaling and consolidation with Karpenter using Spot instances, and per-tenant cost allocation via Kubecost. The goal is to operate with security, cost predictability, and infrastructure efficiency without sacrificing team autonomy.

Running dozens of tenants on a single EKS cluster is an efficiency bet that only pays off if isolation, cost, and scale are treated as first-class citizens from day zero — not as late-stage adjustments.

The Problem

Internal Kubernetes platforms grow organically. One team creates a cluster, another asks for access, and eighteen months later you have forty teams sharing infrastructure designed for five. The typical result is a cluster where nobody knows exactly what costs what, critical workloads compete with batch jobs for nodes, and the blast radius of a misconfigured namespace can affect neighbors.

The scenario this document addresses is an EKS platform managed by a central platform team, serving between 30 and 60 product teams. Each team is a tenant: they have their own namespaces, repositories, pipelines, and ideally their own cost account. The cluster needs to scale reactively and economically, maintain isolation strong enough to satisfy security audits, and produce cost data granular enough for showback or chargeback.

The three most common failure vectors in this model are: (1) insufficient isolation — one tenant consumes another's resources due to missing quotas or network policies, or worse, achieves privilege escalation via misconfigured RBAC; (2) invisible cost — without consistent labels and allocation tooling, the AWS bill becomes an aggregate number nobody can contest or optimize; (3) slow or expensive autoscaling — Cluster Autoscaler with fixed node groups can't keep up with heterogeneous workload spikes and tends toward overprovisioning for safety. Karpenter solves the third vector, but needs guardrails to avoid becoming an uncontrolled cost vector in the hands of unrestricted tenants.

Goals and Non-Goals

✅ GOAL: Tenant isolation with dedicated namespace, ResourceQuota, LimitRange, default NetworkPolicy, and least-privilege RBAC
✅ GOAL: Reactive autoscaling and node consolidation via Karpenter with Spot support and On-Demand fallback
✅ GOAL: Per-tenant cost allocation with namespace granularity via Kubecost (showback at MVP, chargeback in the future)
✅ GOAL: Workload identity via IRSA / EKS Pod Identity with no static credentials
✅ GOAL: Node pool separation by workload class (system, production, batch/spot) via Karpenter NodePool and NodeClass
❌ NON-GOAL: Cluster-level isolation per tenant (cluster-per-tenant is out of scope for this iteration)

Fact Sheet

Platform
EKS (composite scenario, grounded in real practices)
Tenants
30–60 product teams
Estimated scale
500–1500 pods in production, batch peaks up to 3x
Primary region
us-east-1 (multi-AZ: 3 AZs)
EKS version
1.29+ (managed, auto-upgrade via pipeline)
Autoscaler
Karpenter v1.x (replacing Cluster Autoscaler)
Cost / observability
Kubecost + AWS Cost Allocation Tags + Container Insights
Workload identity
EKS Pod Identity (preferred) + IRSA (legacy)
Networking
VPC CNI + Network Policies (Calico or native VPC CNI)
Estimated Spot savings
50–70% vs On-Demand for batch workloads (market estimate)

Proposed Design: Layered Isolation

The isolation model adopted is namespace-as-tenant-boundary, reinforced across four independent layers. No single layer is sufficient; defense in depth is the governing principle.

Layer 1 — Namespace and identity: Each tenant receives a dedicated namespace (or a set of them, by environment: team-x-prod, team-x-staging). Provisioning is automated via GitOps (ArgoCD or Flux): a PR to the platform repository creates the namespace, applies default ResourceQuota and LimitRange, and creates the RoleBinding granting the team's ServiceAccount edit permissions within the namespace, never cluster-admin. The RBAC policy follows least-privilege: teams cannot create ClusterRoles, access secrets from other namespaces, or modify NetworkPolicies — those operations are reserved for the platform team.

Layer 2 — Network: A default-deny-all NetworkPolicy is applied to each namespace at creation time. Teams must explicitly declare the ingress and egress policies their workloads require. Cross-namespace traffic is blocked by default; exceptions are granted via policy reviewed by the platform team. This eliminates the risk of one tenant inadvertently accessing another tenant's internal APIs. Cluster-internal DNS (kube-dns) remains accessible to all namespaces, but pod access to the Kubernetes API is restricted via NetworkPolicy to the control plane CIDR.

Layer 3 — Compute and quotas: ResourceQuota defines limits on CPU, memory, pod count, and object counts (Services, ConfigMaps, Secrets) per namespace. LimitRange defines default requests and limits for containers without explicit specification — this is critical for the Kubernetes scheduler and Karpenter to make correct bin-packing decisions. Workloads without defined requests are treated as BestEffort and may be evicted first under memory pressure. The platform team defines three quota tiers (small, medium, large) and tenants request the appropriate tier via PR.

Layer 4 — Workload identity: No pod should carry static AWS credentials. The preferred model is EKS Pod Identity, which associates a Kubernetes ServiceAccount with an IAM Role via the eks-pod-identity-agent DaemonSet. The role is created with minimal scope and reviewed by the security team. IRSA remains supported for legacy workloads, but new onboardings use Pod Identity. This eliminates the vulnerability class where credentials are exposed in environment variables or ConfigMaps.

Architecture: Multi-Tenant EKS with Karpenter

EKS cluster view showing node pool separation by workload class, per-tenant namespace isolation, autoscaling flow via Karpenter, and integration with AWS identity and cost services.

👤 Tenants / GitOps
  • Dev Teams · PR → Git
  • ArgoCD / Flux · GitOps Controller
🔐 AWS Control Plane
  • EKS Control Plane · API Server / etcd
  • IAM · Pod Identity / IRSA
  • ECR · Container Registry
⚙️ Platform Namespace (kube-system / platform)
  • Karpenter · NodePool Controller
  • Kubecost · Cost Allocation
  • Pod Identity Agent · DaemonSet
  • OPA / Kyverno · Admission Webhook
🟦 Tenant A — team-a-prod
  • Namespace: team-a-prod · ResourceQuota + LimitRange
  • Pods (team-a) · ServiceAccount → IAM Role A
  • NetworkPolicy · default-deny + allow rules
🟩 Tenant B — team-b-prod
  • Namespace: team-b-prod · ResourceQuota + LimitRange
  • Pods (team-b) · ServiceAccount → IAM Role B
  • NetworkPolicy · default-deny + allow rules
🖥️ EC2 Node Pools
  • NodePool: system · On-Demand, m6i, tainted
  • NodePool: production · On-Demand + Spot fallback · m6i / m7i / c6i
  • NodePool: batch · Spot-first, diverse · r6i / m6i / c6i
📊 Observabilidade / Custo
  • CloudWatch · Container Insights
  • S3 + CUR · Cost & Usage Report
  • AWS Cost Allocation Tags · team / env / service

Karpenter: NodePools, Spot, and Consolidation

Karpenter replaces Cluster Autoscaler with a fundamentally different approach: instead of scaling pre-defined node groups, it observes Unschedulable pods and provisions the most suitable node directly via the EC2 Fleet API, respecting constraints declared in the pod (nodeSelector, affinity, tolerations, topology spread). This enables more efficient bin-packing and significantly lower provisioning latency.

NodePool structure: I define three NodePools with distinct responsibilities:

  1. system — On-Demand nodes dedicated to platform workloads (Karpenter, CoreDNS, Kubecost, admission webhooks). These nodes carry a CriticalAddonsOnly taint and are excluded from aggressive consolidation. I use fixed m6i.large or m6i.xlarge instances for predictability.
  1. production — nodes for product workloads with SLAs. Strategy: On-Demand as the base, with Spot capacity for up to 30% of nodes when available. The EC2NodeClass defines a diverse list of instance types (m6i, m7i, c6i, c7i) to maximize Spot availability. karpenter.sh/capacity-type: [on-demand, spot] with spot-to-on-demand-ratio controlled via weight. PodDisruptionBudgets are mandatory for workloads in this pool — without a PDB, Karpenter can consolidate nodes and bring down all pods of a Deployment at once.
  1. batch — Spot-first nodes for ML, ETL, and CI jobs. Here I accept interruption with declared tolerance (karpenter.sh/interruption-queue integrated with SQS to receive EC2 interruption notifications and gracefully drain nodes before termination). Instance type diversity is maximized in this pool — r6i, m6i, c6i, r7i, m7i in multiple sizes — to ensure high Spot Placement Scores.

Consolidation: I enable consolidationPolicy: WhenUnderutilized for the batch pool and WhenEmpty for production. The difference matters: WhenUnderutilized allows Karpenter to move pods between nodes to free underutilized nodes (active bin-packing), while WhenEmpty only removes completely empty nodes. For production, aggressive consolidation can cause excessive disruption; I prefer WhenEmpty with well-configured PDBs.

Cost guardrails: Without constraints, a tenant can create a Job requesting 1000 pods of 4 vCPU each, and Karpenter will provision 250 c6i.xlarge nodes in seconds. To prevent this, ResourceQuota in the namespace limits total requestable CPU/memory. Additionally, the NodePool defines global limits.cpu and limits.memory — Karpenter won't provision beyond these limits even if there are pending pods. This creates a cost ceiling per pool.

Evaluated Alternatives

Cluster per Tenant (Hard Multi-Tenancy)

Pros
  • Maximum isolation: blast radius contained per cluster
  • Independent upgrade and configuration per tenant
  • Suitable for strict compliance requirements (PCI, HIPAA)
Cons
  • Control plane cost: ~$0.10/h per EKS cluster = significant overhead for many tenants
  • Operational overhead: upgrades, patches, and monitoring multiplied by N clusters
  • Capacity fragmentation: each cluster needs minimum nodes for system workloads

Reserved for tenants with compliance requirements the shared cluster cannot satisfy

Cluster Autoscaler (keep legacy)

Pros
  • Mature and widely documented
  • No migration required
Cons
  • Scale-up latency: 2–5 min vs ~30s for Karpenter
  • Fixed node groups: hard to diversify instance types for Spot
  • Limited consolidation: no active bin-packing
  • Configuration overhead: dozens of ASGs to cover workload diversity

Rejected: the opportunity cost of not migrating to Karpenter is high at scale

Namespace-as-Tenant (proposed)

Pros
  • Cost efficiency: maximum bin-packing, single control plane
  • Centralized operation: unified upgrades, policies, and observability
  • Fast onboarding: GitOps automates namespace creation in minutes
Cons
  • Shared kernel isolation: container escape vulnerabilities affect all tenants
  • Noisy neighbor on CPU/memory if quotas are not well calibrated
  • Not suitable for PCI/HIPAA compliance without significant additional controls

Adopted: best trade-off for the described tenant profile (product teams without strict compliance requirements)

Service Mesh (Istio) for network isolation

Pros
  • Automatic mTLS between services, L7 traffic observability
  • More expressive authorization policies than NetworkPolicy
Cons
  • Sidecar overhead: ~50–100MB RAM and ~0.5 vCPU per pod (estimate)
  • High operational complexity: steep learning curve, traffic debugging
  • Additional latency in the data path

Deferred: NetworkPolicy covers the current threat model; revisit when mTLS between services is an explicit requirement

Cost Allocation Models

ModelGranularityAccuracyOperational ComplexitySuitable for Chargeback?
AWS Cost Tags (pure)Account / AWS ServiceLow (pod-blind)LowNo — too aggregated
Kubecost (proposed)Namespace / Label / PodHigh (models CPU+RAM+net+storage)Medium (requires consistent labels)Yes — with correct labels
OpenCost (OSS alternative)Namespace / LabelMediumLow (no enterprise UI)Partially — requires custom integration
Split Cost Allocation (AWS native)EKS namespace (via CUR)Medium (proportional to requests)Low (console configuration)Yes — integrated with CUR/Cost Explorer

Decision: EKS Pod Identity over IRSA for new workloads

Accepted
Context

IRSA (IAM Roles for Service Accounts) requires ServiceAccount annotation and OIDC provider configuration. EKS Pod Identity, launched in 2023, simplifies the model: the agent on the node intercepts IMDS calls and injects credentials without OIDC annotation. It supports multiple roles per ServiceAccount and doesn't have the 100 OIDC providers per account limit.

Decision

New workloads use EKS Pod Identity. Legacy workloads with IRSA are migrated opportunistically. The eks-pod-identity-agent is installed as a managed add-on on all node groups.

Consequences
  • Simplified onboarding: teams don't need to configure OIDC trust policies manually
  • Reduced risk: no dependency on external OIDC endpoint for workload authentication
  • IRSA + Pod Identity coexistence period during migration: clearly document which mechanism each workload uses

Rollout Plan

  1. 1

    Phase 0 — Foundation (Weeks 1–2)

    Install Karpenter via Helm on the existing cluster (without removing Cluster Autoscaler yet). Create system NodePool with taints. Configure EC2NodeClass with AL2023 AMI family, private subnets, and correct security groups. Validate that platform workloads (CoreDNS, Kubecost, admission webhooks) are scheduled on system nodes. Install eks-pod-identity-agent as a managed add-on.

  2. 2

    Phase 1 — Autoscaling Migration (Weeks 3–4)

    Create production and batch NodePools. Migrate 20% of production workloads to Karpenter-managed nodes (via nodeSelector). Monitor provisioning latency, Spot interruption rate, and consolidation behavior. After validation, scale to 100% of workloads and remove Cluster Autoscaler. Configure SQS queue for Spot interruption and validate graceful draining.

  3. 3

    Phase 2 — Tenant Isolation (Weeks 5–7)

    Implement GitOps namespace onboarding pipeline: PR creates namespace, applies ResourceQuota (selected tier), LimitRange, default-deny NetworkPolicy, RoleBinding. Install and configure OPA Gatekeeper or Kyverno with policies to: (a) reject pods without resource requests, (b) reject images without digest or from unapproved registries, (c) reject pods with hostNetwork: true or privileged: true outside the platform namespace. Audit existing namespaces and remediate violations.

  4. 4

    Phase 3 — Cost Allocation (Weeks 8–10)

    Configure Kubecost with CUR (Cost & Usage Report) integration via S3. Define mandatory label taxonomy: team, env, service, cost-center. Implement Kyverno policy that rejects Deployments without mandatory labels. Enable AWS Split Cost Allocation for EKS in Cost Explorer as cross-validation. Publish per-tenant showback dashboard. Define monthly cost review process with teams.

  5. 5

    Phase 4 — Hardening and Observability (Weeks 11–12)

    Migrate legacy IRSA workloads to EKS Pod Identity. Enable EKS control plane logging (API, audit, authenticator) to CloudWatch. Configure per-namespace cost alerts in Kubecost (threshold per quota tier). Implement Spot interruption response runbook. Conduct chaos exercise: simulate Spot node interruption in production and validate that PDBs prevent unavailability. Document and publish onboarding guide for teams.

Risks and Mitigations

R1 — Karpenter consolidation kills workloads without PDB: Karpenter can consolidate nodes and evict all pods of a single-replica Deployment. Mitigation: Kyverno policy that rejects Deployments with replicas: 1 in production without a corresponding PDB. Alert teams during onboarding. R2 — Spot interruption during production peak: If the production NodePool has a high Spot ratio and the EC2 market is tight, multiple nodes may be interrupted simultaneously. Mitigation: limit Spot to 30% of the production NodePool; diversify instance types; configure topologySpreadConstraints to distribute pods across AZs and node types. R3 — Tenant escapes namespace via RBAC misconfiguration: An incorrect RoleBinding can grant access to resources outside the namespace. Mitigation: periodic RBAC audit with kubectl-who-can; admission policy blocking creation of ClusterRoleBindings by tenant ServiceAccounts. R4 — Kubecost with inaccurate data due to inconsistent labels: Workloads without correct labels appear as unallocated cost, invalidating showback. Mitigation: mandatory Kyverno policy + 30-day grace period for existing teams to migrate. R5 — Outdated EKS version increases attack surface: CVEs in old Kubernetes versions are exploitable. Mitigation: automated upgrade pipeline with maintenance window; policy of not supporting versions more than 2 minor releases behind the current EKS version.

Well-Architected Assessment

Security

Strong: least-privilege RBAC, default-deny NetworkPolicy, EKS Pod Identity with no static credentials, admission webhooks blocking insecure configurations, audit logging enabled.

Reliability

Good: mandatory PDBs, topologySpreadConstraints, Spot with On-Demand fallback, conservative consolidation in production. Residual risk: simultaneous Spot interruption on multiple nodes.

Performance efficiency

High: Karpenter provisions nodes in ~30s vs 2–5min for CAS; active bin-packing reduces resource fragmentation; LimitRange ensures all pods have defined requests for correct scheduling.

Sustainability

Positive: node consolidation reduces idle instances; Spot utilizes excess EC2 capacity; efficient bin-packing reduces total cluster footprint.

Success Metrics and Targets

Node provisioning latency (p95)
< 60 seconds (CAS baseline: ~3 minutes)
Average node CPU utilization
> 60% (typical baseline without consolidation: 30–40%)
% of allocated cost (Kubecost)
> 90% (unallocated cost < 10%)
Spot ratio in batch workloads
> 70% of batch nodes running on Spot
Blocked NetworkPolicy violations
100% of namespaces with default-deny applied (audited monthly)
New tenant onboarding time
< 30 minutes (PR approved → namespace ready)
Cross-tenant data access incidents
0 (measured via EKS audit log)
EC2 cost reduction vs baseline (estimate)
25–40% via consolidation + Spot (estimate; depends on workload mix)
FA
My Senior Take
Senior Solutions Architect

The most common mistake I see in multi-tenant Kubernetes platforms isn't technical — it's sequencing. Teams install Karpenter first (because it's exciting and results are immediate) and leave isolation and cost allocation for later. Six months later, you have an efficient cluster that nobody can audit and where a quota bug can cause a five-figure surprise bill. My recommendation is to invert the priority order: isolation foundation first, autoscaling second, cost in parallel. ResourceQuotas and NetworkPolicies are cheap to implement and hard to retrofit — every namespace without default-deny is a security technical debt that grows with the number of tenants. On Karpenter specifically: the consolidation feature is powerful, but needs careful calibration in production. I always start with WhenEmpty and only move to WhenUnderutilized after having validated PDBs on all critical workloads. The difference between the two modes is the difference between 'remove empty nodes' and 'actively move pods' — the latter can cause unexpected disruption if PDBs are misconfigured or absent. On cost: Kubecost is good, but its accuracy depends entirely on label discipline. Without an admission policy forcing mandatory labels, you'll have 30% of cost as 'unallocated' and no team will trust the numbers. The Kyverno policy for mandatory labels is not optional — it's what makes showback credible. Finally, on EKS Pod Identity vs IRSA: migrate. The model is simpler, more secure (no OIDC thumbprint to manage), and the agent handles credential rotation automatically. The only reason to keep IRSA is compatibility with workloads that can't be migrated in the short term — and even those should have a defined migration date.

Verdict

The proposed architecture is viable and represents the state of the art for multi-tenant EKS platforms in 2024–2025. The combination of namespace-as-tenant-boundary with default-deny NetworkPolicy, per-tier ResourceQuota, Karpenter with segregated NodePools, and Kubecost with mandatory labels covers the three identified failure vectors: insufficient isolation, invisible cost, and inefficient autoscaling. The design is not perfect — no shared cluster is. Kernel isolation is the fundamental limit: a container escape exploit affects all tenants. For organizations with strict compliance requirements (PCI-DSS, HIPAA, external tenants), cluster-per-tenant is the correct answer, and the additional operational cost is justified by the threat model. For the described profile — internal product teams without strict compliance requirements — the efficiency vs isolation trade-off is well calibrated. The highest execution risk is rollout sequencing. The temptation to skip Phase 2 (isolation) to quickly reach the cost savings of Phase 3 is real. Resist it. A cross-tenant incident caused by missing NetworkPolicy or misconfigured RBAC costs more — in investigation time, team trust, and potential security impact — than any infrastructure savings you anticipated.

#eks#kubernetes#multi-tenant#karpenter#spot#kubecost#irsa#cost-optimization
Share:
Written with AI assistance from the public case and my architect's reading.