# Playbook: Zero-Trust on AWS from Scratch — 6 Steps That Fit in a Sprint

Zero-trust is not a product or a six-month project — it is a set of architectural decisions you can start implementing today. This playbook presents six concrete, prioritized steps to eliminate implicit trust on AWS, reduce blast radius, and build continuous auditing, without halting delivery.

- URL: https://fernando.moretes.com/studies/playbook-zero-trust-na-aws-em-6-passos

- Markdown: https://fernando.moretes.com/studies/playbook-zero-trust-na-aws-em-6-passos/study.md?lang=en

- Type: Playbook

- Domain: AWS / Segurança

- Date: 2025-03-08

- Tags: zero-trust, aws-security, iam, privatelink, cloudtrail, identity, least-privilege, microsegmentation

- Reading time: 11 min

---

Most AWS breaches don't start with a cryptographic failure — they start with implicit trust: 'it's in the VPC, so it's fine'. This playbook dismantles that premise in six steps executable in a single sprint, delivering strong identity, real least-privilege, and auditing that detects before it becomes an incident.

## What You'll Be Able to Decide and Do

- Understand why 'being in the VPC' is not a credential and what replaces it
- Implement OIDC federation and ephemeral credentials, eliminating static keys
- Write IAM policies with real least-privilege — per function, no wildcard
- Segment the network per service with Security Groups and PrivateLink instead of flat network
- Enable end-to-end encryption (internal TLS + KMS) without excessive operational overhead
- Build a continuous audit pipeline with CloudTrail + automated detection

## Playbook Context

- **Domain:** AWS / Platform Security
- **Type:** Actionable playbook — decision guide and step-by-step
- **Scope:** AWS workloads at any stage — startups to enterprise
- **Time premise:** Steps 1 and 2 in 1–3 days; all 6 in a 2-week sprint
- **Core AWS services:** IAM, IAM Identity Center, STS, KMS, PrivateLink, VPC Security Groups, CloudTrail, GuardDuty, Security Hub
- **Estimated baseline cost:** CloudTrail (management events trail): free for first trail per region; GuardDuty: ~$1–4/GB of analyzed logs (estimate); KMS: $1/key/month + $0.03/10k requests
- **When NOT to apply this entire playbook at once:** Sandbox/experimentation environments with no sensitive data — governance overhead doesn't pay off

## The mental model that unlocks everything: authenticate the request, not the network

The traditional perimeter security model assumes everything inside the network is trusted. On AWS, this translates to: 'it's in the same VPC, so it can call that service'. This assumption is the most exploited attack vector in cloud environments — not because the VPC is insecure, but because it was never designed to be the sole access control.

Zero-trust inverts the logic: **the network is hostile by definition**. Every request must prove identity, be authorized by explicit policy, and be logged. The VPC becomes a routing detail, not a security boundary.

In practice, this means three mindset shifts:

1. **Identity replaces network location.** A Kubernetes pod, a Lambda function, or an EC2 instance is not trusted because it's in a private subnet — it's trusted because it presented a verifiable credential (OIDC token, IAM role assumed via STS) and that credential was authorized by explicit policy.

2. **Least-privilege is not 'remove the admin'.** It's modeling what each identity *needs to do* and nothing more. A service that reads S3 objects doesn't need `s3:*` — it needs `s3:GetObject` on a specific ARN. The difference between the two is the difference between a contained incident and a full exfiltration.

3. **Auditing is not compliance — it's detection.** CloudTrail logging everything with nobody reading it is not zero-trust, it's security theater. The value is in closing the loop: log → analysis → alert → automated response.

These three principles translate directly into the six steps of this playbook. Order matters: identity and least-privilege first because they have the highest return per hour invested. Encryption and auditing amplify what you've already built.

## Traditional Perimeter vs. Zero-Trust on AWS
| Criterion | Dimension | Traditional Perimeter | Zero-Trust on AWS |
| --- | --- | --- | --- |
| Trust model | Trusts the network (VPC/subnet) | Trusts verified identity per request | — |
| Access credential | Static access key (IAM user key) | Ephemeral credential via STS AssumeRole / OIDC | — |
| Blast radius of compromise | High — free lateral movement inside the VPC | Low — limited to the scope of the compromised role | — |
| Internal service access control | Security Group + NACL (IP/port based) | IAM policy + VPC endpoint policy + SG per service | — |
| Internal network traffic | Often unencrypted (trusts the perimeter) | TLS mandatory even inside the VPC; KMS for data at rest | — |
| Auditing | Network flow logs (VPC Flow Logs) — who talked to whom | CloudTrail API-level — who did what, with which identity, in what context | — |
| Anomaly detection | Reactive — detects after lateral movement | Proactive — GuardDuty detects compromised credential before damage | — |
| Initial implementation cost | Low (existing network configuration) | Medium — requires IAM refactoring and identity pipelines | — |

## Why step order matters: return on effort

Engineers often want to start with what's visible — network diagrams, segmentation, firewalls. But in zero-trust on AWS, the highest return per hour invested is in the first two steps: strong identity and IAM least-privilege.

The reasoning is direct: if you eliminate static keys and ensure each identity has only what it needs, you've reduced the blast radius of any future compromise before you even know it's going to happen. Microsegmentation without strong identity is a sandcastle — you've limited network movement but not logical movement via stolen credential.

The sequence I recommend has dependency logic:

- **Steps 1 and 2 (Identity + Least-Privilege)** are independent and have immediate impact. Can be done in parallel by two people in 1–3 days.
- **Step 3 (Adaptive Access)** depends on established strong identity — you can't add session context without knowing who the session is.
- **Step 4 (Microsegmentation)** depends on least-privilege — per-service Security Groups only make sense if identities are already well-defined.
- **Step 5 (Encryption)** can be done in parallel with 3 and 4, but depends on well-governed KMS keys (which depend on the IAM from step 2).
- **Step 6 (Continuous Auditing)** closes the loop — it's only useful if the other steps are in place, because logs need identity context to be actionable.

This sequence is not dogma — it's risk optimization per sprint. If you only have one week, do steps 1 and 2. If you have two, add 4 and 6. The important thing is not to start from the end.

## The 6 Steps: Zero-Trust on AWS in a Sprint

1. **Step 1 — Strong Identity: Federation, OIDC, and Ephemeral Credentials** — **Goal:** Eliminate static access keys (IAM user access keys) from workloads and pipelines. Every non-human identity must assume roles via STS; every human identity must use federation.

**Concrete actions:**
1. Enable **IAM Identity Center** (SSO) with your IdP (Okta, Azure AD, Google Workspace). Configure mandatory MFA for all users.
2. For EKS workloads: configure **IRSA (IAM Roles for Service Accounts)** — each pod assumes a specific role via OIDC, with no credential in the environment.
3. For EC2/ECS workloads: use **Instance Profile / Task Role** — never put `AWS_ACCESS_KEY_ID` in an environment variable or config file.
4. For CI/CD pipelines (GitHub Actions, GitLab): configure **OIDC federation** — the pipeline assumes an IAM role via OIDC token, with no stored secret.
5.

2. **Step 2 — IAM Least-Privilege: Per Function, No Wildcard** — **Goal:** Each IAM role authorizes exactly what the function needs — no more, no less. No production policy with `Action: '*'` or `Resource: '*'` without documented justification.

**Concrete actions:**
1. Use **IAM Access Analyzer** to identify unused access: `aws accessanalyzer create-analyzer` → generate external and unused access findings.
2. For existing roles: use **IAM Access Advisor** to see which services were actually accessed in the last 90 days — remove the rest.
3. Implement **permission boundaries** for roles created by automation pipelines (CDK, Terraform) — cap the permission ceiling even if the code tries to create a more permissive role.
4. Use **conditions** in policies to restrict by context: `aws:RequestedRegion`, `aws:PrincipalTag`, `s3:prefix`.

3. **Step 3 — Adaptive Access: Session Context and Metadata in Policy** — **Goal:** The access policy considers not just 'who' but 'from where', 'when', and 'with what context' — and can deny even a valid identity if the context is anomalous.

**Concrete actions:**
1. Use **session tags** in STS AssumeRole to propagate context: `aws sts assume-role --tags Key=Environment,Value=prod Key=Team,Value=payments`. Policies can then use `aws:PrincipalTag/Environment` as a condition.
2. Configure **IAM condition keys** for context: `aws:MultiFactorAuthPresent` (require MFA for destructive operations), `aws:SourceIp` (restrict console access to corporate IPs), `aws:RequestedRegion`.
3. For sensitive human access (production console, delete operations): require MFA with `Condition: { "Bool": { "aws:MultiFactorAuthPresent": "true" } }`.
4.

4. **Step 4 — Microsegmentation: Per-Service Security Groups and PrivateLink** — **Goal:** Eliminate the flat network where any resource in the VPC can reach any other. Each service has its own network perimeter, and traffic to AWS services never goes to the public internet.

**Concrete actions:**
1. **Security Group per service, not per tier:** Instead of a `backend-sg` covering all microservices, create `payments-service-sg`, `orders-service-sg` etc. Ingress rules reference the source SG, not a CIDR.
2. **Eliminate open egress rules:** The SG default allows all egress — restrict explicitly. A service that only calls RDS doesn't need egress to the internet.
3. **VPC Endpoints (PrivateLink) for AWS services:** S3, DynamoDB, SQS, SNS, KMS, Secrets Manager — all must be accessed via private endpoint. Traffic never passes through the internet gateway.

5. **Step 5 — End-to-End Encryption: Internal TLS and KMS** — **Goal:** Data in transit and at rest is encrypted regardless of the network — because the network is not trusted by definition.

**Concrete actions:**
1. **Mandatory TLS even inside the VPC:** Configure internal ALBs with ACM certificates. For service-to-service communication (service mesh), use AWS App Mesh with mutual TLS (mTLS) or Envoy with managed certificates.
2. **KMS for data at rest:** Enable encryption at rest with CMKs (Customer Managed Keys) on S3, RDS, DynamoDB, EBS, Secrets Manager. Avoid AWS-managed keys for sensitive data — CMKs allow usage auditing via CloudTrail and revocation.
3. **Secrets Manager instead of environment variables:** Never put secrets in env vars or SSM parameters without encryption. Use Secrets Manager with automatic rotation and access via IAM role.

6. **Step 6 — Continuous Auditing: CloudTrail + Automated Detection** — **Goal:** Close the loop — every action with identity is logged, analyzed, and if anomalous, generates an alert or automated response before it becomes an incident.

**Concrete actions:**
1. **Multi-region, multi-account CloudTrail:** Enable an organization trail via AWS Organizations — covers all accounts and regions automatically. Send to a centralized S3 bucket in a dedicated security account with MFA Delete enabled.
2. **CloudTrail Insights:** Enable to detect anomalies in API call volume — identifies compromised credential doing reconnaissance (e.g., 1000 `DescribeInstances` calls in 5 minutes).
3. **GuardDuty enabled in all accounts:** Analyzes CloudTrail, VPC Flow Logs, and DNS logs.

## Zero-Trust Architecture on AWS: Identity → Adaptive Policy → Segmented Service → Auditing

Complete flow of a request in a zero-trust environment: from identity authentication (human or workload) to execution in the segmented service and the auditable record. Each layer is an independent control — compromising one does not compromise the others.

### 👤 Identity Layer

- Human User IdP + MFA (user)
- Workload EKS Pod / Lambda (compute)
- CI/CD Pipeline GitHub Actions (ci)

### 🔑 Auth & Token Layer

- IAM Identity Center SSO Federation (security)
- OIDC Provider IRSA / GitHub OIDC (security)
- AWS STS AssumeRole Ephemeral Creds (security)

### 📋 Policy & Context Layer

- IAM Policy Least-Privilege + Conditions (security)
- SCP Org Guardrails (security)
- ABAC Session Tags + Resource Tags (security)

### 🔒 Network & Service Layer

- Security Groups Per-Service No flat network (network)
- PrivateLink VPC Endpoints No internet path (network)
- Target Service S3 / RDS / SQS + KMS Encrypted (data)
- AWS KMS CMK Encrypt at rest (security)

### 🔍 Audit & Detection Layer

- CloudTrail All regions Org trail (security)
- GuardDuty Anomaly detection All accounts (security)
- Security Hub Unified findings CIS + FSBP (security)
- EventBridge + Lambda Auto-remediation (security)

### Flows

- human -> idc: SAML/OIDC + MFA
- workload -> oidc: OIDC token (IRSA)
- cicd -> oidc: OIDC token
- idc -> sts: AssumeRole
- oidc -> sts: AssumeRoleWithWebIdentity
- sts -> iam: ephemeral credential
- iam -> scp: guardrail check
- iam -> abac: session tags
- iam -> sg: authorized → accesses network
- sg -> pl: private traffic
- pl -> svc: endpoint policy check
- svc -> kms: encrypt/decrypt
- sts -> ct: API call logged
- svc -> ct: data access logged
- ct -> gd: continuous analysis
- gd -> sh: findings
- sh -> eb: HIGH severity
- eb -> sts: revoke session

## What zero-trust doesn't solve — and where you still need other layers

Zero-trust is an access control philosophy, not a complete security solution. There are classes of problems it doesn't directly cover:

**Application vulnerabilities:** Zero-trust authenticates and authorizes the request, but if the application code has a SQL injection or SSRF, the authenticated request can be used to exploit the vulnerability. WAF (AWS WAF), static code analysis (SAST), and penetration testing are complementary, not replaced.

**Sophisticated insider threat:** A legitimate user with legitimate access who slowly exfiltrates data within the scope of their role is hard to detect with zero-trust alone. This is where DLP (Data Loss Prevention), behavioral analysis (Amazon Macie for sensitive data in S3), and periodic access reviews come in.

**Supply chain attacks:** If a third-party dependency in your code is compromised, it runs with your service's identity and permissions. Zero-trust doesn't solve this — you need dependency analysis (Amazon Inspector, Dependabot), verified base images, and SBOM.

**Real operational cost:** Implementing zero-trust correctly increases operational complexity. Debugging connectivity issues becomes harder when there are multiple control layers (IAM policy + SCP + endpoint policy + SG). You need good observability and clear runbooks to avoid trading security risk for availability risk.

The practical conclusion: zero-trust is necessary but not sufficient. It drastically reduces blast radius and eliminates common attack vectors — but it needs to be part of a defense-in-depth strategy, not the complete strategy.

> **Anti-Patterns: Where Zero-Trust Bites in Production:** **1. 'Being in the VPC' as implicit credential**
The most common anti-pattern: Security Groups allowing traffic from any instance in the same subnet, without identity verification. An attacker who compromises any instance in the VPC inherits that access. Fix: SG per service with reference to source SG, not CIDR.

**2. IAM with `Action: '*'` or `Resource: '*'` in production**
Wildcard policies are the equivalent of giving the datacenter key to every developer. They frequently arise from 'it was just for testing' that was never reverted. Fix: IAM Access Analyzer + Config rule `iam-no-inline-policy` + quarterly policy review.

**3. Static access keys in environment variables or code**
Still the most frequent vector for AWS account compromise. Keys in `.env`, in Git repositories, in Docker images. Fix: IRSA/Task Role for workloads, OIDC for CI/CD, Secrets Manager for application secrets. Monitor with `git-secrets` and GuardDuty.

**4. CloudTrail enabled but nobody reading it**
Security theater. Logs without analysis are just storage cost. Fix: GuardDuty + Security Hub + at least one EventBridge rule for automated response.

**5. Implementing all 6 steps at once without testing**
Badly implemented zero-trust can take down production services (overly restrictive SG, endpoint policy blocking legitimate access). Fix: implement in staging environments first, use `aws iam simulate-principal-policy` to test policies before applying, and enable CloudTrail before any change to have an auditable rollback.

> **Rule of Thumb:** **Start with identity + least-privilege. Always.**

If you can only do two steps this week, do steps 1 and 2: eliminate static keys and remove IAM wildcards. These two steps reduce the blast radius of any future compromise — before you even know it's going to happen. Microsegmentation without strong identity is decoration. Auditing without least-privilege is noise. Identity + least-privilege is the foundation — everything else amplifies what you've already built.

> **My Perspective: What I Actually Do in Practice:** After 16 years building financial systems and data platforms, the pattern that cost me the most to learn is this: **security is not a feature you add at the end — it is an emergent property of architectural decisions you make from the first sprint**.

In practice, when I join a new project or do an architecture review, the first questions I ask are not about performance or cost — they are: 'Who is this identity and how does it prove who it is?' and 'What can this role do that it shouldn't be able to?'. These two questions reveal more about a system's security posture than any pentest.

What I concretely do:
- **I never approve a PR that adds a static access key** in code, environment variable, or CI/CD secret without OIDC as a documented alternative. Static key is security technical debt with compound interest.
- **I use `aws iam simulate-principal-policy` as part of the CI pipeline** to test policies before deployment — it's like unit tests for IAM.
- **I enable GuardDuty and Security Hub on day zero** of any new account, even if it's a sandbox. The cost is minimal; the value of having a detection baseline from the start is enormous.
- **I treat PrivateLink endpoint policies as important as IAM policies** — many people configure the endpoint but leave the policy as `Allow *`, which nullifies the control.
- **I do quarterly IAM Access Advisor reviews** to identify unused permissions. Unused permission is attack surface you're paying to maintain.

What I *don't* do: I don't implement everything at once in production without staging. I've seen badly implemented zero-trust take down critical services due to overly restrictive SGs. The sequence matters, and testing before applying is not optional.

## Zero-Trust and the AWS Well-Architected Framework

- **security**: Zero-trust is the direct implementation of WAF security principles: strong identity (SEC 2), least-privilege (SEC 3), data protection (SEC 8/9), and detection (SEC 4). The 6 steps map directly to security pillar best practices.
- **reliability**: Microsegmentation and least-privilege reduce failure blast radius — a compromised service cannot cascade to others. However, overly restrictive SGs can create connectivity failure points: test in staging before production.
- **performance**: Internal TLS and PrivateLink have negligible latency in modern workloads. KMS adds ~1-5ms per envelope encryption operation — acceptable for the vast majority of use cases.

## Verdict

Zero-trust on AWS is not a six-month transformation project — it is a sequence of six architectural decisions you can make in a sprint. Order matters: strong identity and least-privilege first, because they reduce blast radius before any compromise happens. Microsegmentation and encryption amplify what you've already built. Auditing closes the loop.

What separates a real zero-trust environment from security theater is simple: **every request proves who it is, is authorized by explicit policy, and is logged in a way that lets you detect anomaly before it becomes an incident**. If you can't answer 'which identity made that API call and was it authorized to do so?', you're still in the perimeter model — regardless of how many firewalls you have.

Start today. Steps 1 and 2. The rest comes in the next sprint.

## References

- [AWS — Zero Trust on AWS](https://aws.amazon.com/security/zero-trust/)
- [AWS — IAM Best Practices](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html)
- [AWS — AWS PrivateLink](https://aws.amazon.com/privatelink/)
- [AWS — AWS CloudTrail](https://aws.amazon.com/cloudtrail/)
- [AWS — IAM Access Analyzer](https://docs.aws.amazon.com/IAM/latest/UserGuide/what-is-access-analyzer.html)
- [AWS — Amazon GuardDuty](https://aws.amazon.com/guardduty/)
- [AWS — AWS Security Hub](https://aws.amazon.com/security-hub/)
- [AWS — IAM Roles for Service Accounts (IRSA)](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html)

## Case sources

- [AWS — Zero Trust on AWS](https://aws.amazon.com/security/zero-trust/)
- [AWS — IAM best practices](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html)
- [AWS — AWS PrivateLink](https://aws.amazon.com/privatelink/)
- [AWS — AWS CloudTrail](https://aws.amazon.com/cloudtrail/)
