Design Doc: Zero Trust on AWS for Internal Service Access
Listen to study
generated on playGenerated only on first play
This document proposes a Zero Trust architecture on AWS where identity, context, and device posture replace the network perimeter as the primary access control mechanism. The design covers workload segmentation, adaptive access via IAM Identity Center and Verified Access, and continuous audit instrumentation. The goal is to eliminate implicit trust based on network location without introducing excessive operational friction.
Network perimeter is no longer a security guarantee. This RFC defines how to redesign access to internal services on AWS using identity as the new perimeter — with least privilege, granular segmentation, and real-time context-based access decisions.
The Problem: Implicit Trust as an Attack Surface
Most enterprise architectures on AWS still operate with a network-location-based trust model: if a resource is inside the VPC or connected via VPN, it receives implicit access to other internal services. This model was reasonable when network boundaries were well-defined and static. Today, it is architectural debt.
The problem manifests in concrete ways. Engineers with SSH access to a bastion host inside the VPC can reach production databases unrelated to their role. Services with overprovisionned IAM roles — created during a rushed sprint — accumulate permissions that are never revoked. Workloads in different accounts communicate via peering without any traffic inspection or identity validation. When an attacker compromises any point inside this perimeter, lateral movement is trivial.
The scenario this document addresses is representative of organizations at scale: multiple AWS accounts organized via AWS Organizations, engineering teams with federated access via corporate SSO, mixed workloads (ECS, Lambda, EC2), and a growing set of internal APIs that need to be accessible to developers, CI/CD pipelines, and other services — but not to everyone, not always, and not without traceability.
The solution is not to add more VPNs or more Security Group rules. It is to rethink the trust model from scratch: no entity — human or machine — receives access by default. All access is explicitly authorized, minimally privileged, contextually validated, and continuously audited. This is Zero Trust applied pragmatically, not as a buzzword.
Goals and Non-Goals
Scenario Context
- Document type
- Design Doc / RFC — representative scenario
- Domain
- Cloud infrastructure security — Zero Trust
- Assumed scale
- 50–500 engineers, 10–50 AWS accounts, dozens of internal services
- Primary stack
- AWS IAM Identity Center, Verified Access, VPC Lattice, CloudTrail, Security Hub, GuardDuty, AWS Organizations
- Identity model
- Federation via external IdP (Okta / Azure AD) + IAM Roles for workloads
- Relevant regulatory
- Aligned with NIST SP 800-207 (Zero Trust Architecture) and AWS Well-Architected Security Pillar
- RFC status
- Proposed
Proposed Design: Identity as the Perimeter
The design is organized into four complementary layers that, together, implement Zero Trust principles without requiring a complete rewrite of existing infrastructure.
Layer 1 — Centralized and Federated Identity. AWS IAM Identity Center becomes the single entry point for human access to all AWS accounts. Federation with the corporate IdP (Okta or Azure AD) ensures that multi-factor authentication, password policies, and identity lifecycle are managed in the company's system of record. Permission Sets in Identity Center are mapped to IdP groups, and the principle of least privilege is applied by role: a backend engineer does not receive, by default, access to data production accounts. Elevated access (e.g., break-glass for production) is granted via an approval flow with a maximum TTL of 4 hours and automatic notification to the security team.
Layer 2 — Adaptive Access for Developers. AWS Verified Access replaces VPN for developer access to internal tools (dashboards, admin APIs, observability consoles). Each access request is evaluated in real time against three signals: identity verified by the IdP (with MFA), device posture via integration with the existing MDM/EDR agent, and session context (time, location, anomalies detected by GuardDuty). There is no persistent VPN session that, once established, grants broad access. Each request is a new authorization decision.
Layer 3 — Inter-Service Communication with VPC Lattice. For service-to-service communication, VPC Lattice provides a unified control plane that applies workload identity-based authorization policies (IAM roles) independent of network topology. A payments service in one account can call a notifications service in another account only if the caller's IAM role is explicitly authorized in the destination service's resource policy. There are no open Security Group rules between VPCs. Traffic is authenticated by SigV4 signature on each call. Services that do not yet support VPC Lattice use VPC endpoints with restrictive resource policies as a transition layer.
Layer 4 — Continuous Audit and Detection. CloudTrail with Data Events enabled across all accounts, centralized in a dedicated security account via AWS Organizations. Security Hub aggregates findings from GuardDuty, IAM Access Analyzer, and Config Rules into a unified dashboard. IAM Access Analyzer runs continuously to detect excessive permissions and unintentional external access. A set of custom Config Rules verifies security invariants: no IAM role with :, no Security Group with 0.0.0.0/0 on sensitive ports, no public S3 buckets in production accounts. Critical findings trigger automation via EventBridge for automatic revocation of suspicious sessions.
Zero Trust Architecture: Access Flow and Controls
Diagram shows the three access planes (human, service-to-service, audit) and how each is controlled by identity and context, not network location.
- IdP Corporativo · Okta / Azure AD
- IAM Identity Center · SSO + Permission Sets
- AWS Verified Access · Access por Request
- Desenvolvedor · + Dispositivo Gerenciado
- SCPs · AWS Organizations
- IAM Access Analyzer · Detecção Contínua
- AWS Config · Config Rules
- GuardDuty · Detecção de Ameaças
- Serviço A · ECS / Lambda (Conta Prod)
- Serviço B · ECS / Lambda (Conta Data)
- VPC Lattice · Service Network
- VPC Endpoints · Resource Policies
- RDS / Aurora · IAM Auth Habilitado
- CloudTrail · Data Events (Org)
- Security Hub · Findings Agregados
- EventBridge · Automação de Resposta
- SIEM Externo · S3 + Kinesis
Key Design Decisions and Trade-offs
VPC Lattice vs. Service Mesh (Istio/App Mesh). The choice of VPC Lattice for inter-service communication is deliberate and deserves justification. A service mesh like Istio offers more granular traffic control (circuit breaking, retry policies, traffic splitting), but introduces a complex control plane that most platform teams cannot operate safely. VPC Lattice delegates operational complexity to AWS and integrates natively with IAM — meaning service authorization policies are expressed in the same language as all other AWS policies, are audited by CloudTrail, and are analyzed by IAM Access Analyzer. The trade-off is less traffic flexibility in exchange for much less operational surface area. For most Zero Trust security use cases, this is the correct trade-off.
Verified Access vs. Corporate VPN. VPNs create persistent sessions with broad network access. Once connected, a developer (or an attacker who compromised their credentials) has network reach to all resources in the VPC. Verified Access evaluates each request individually, meaning a compromised session does not automatically grant access to all resources. The cost is additional per-request latency (typically 5–15ms, estimate based on service characteristics) and the need for device posture agent integration. For low-frequency internal tools (dashboards, admin consoles), this cost is acceptable. For high-frequency internal APIs between services, VPC Lattice is the correct mechanism — not Verified Access.
IAM Database Authentication vs. Static Credentials. Enabling IAM Auth on RDS/Aurora eliminates the need to manage database passwords. Authentication is done via a temporary token generated by STS, valid for 15 minutes. The trade-off is that each new database connection requires an STS call to generate the token, which has implications for connection pooling. The solution is to use a proxy (RDS Proxy) that maintains the connection pool and renegotiates IAM authentication transparently. This pattern completely eliminates the class of vulnerability of database credentials leaked in code or environment variables.
SCPs as Guardrails vs. IAM Policies as Controls. Service Control Policies at the Organizations level are guardrails that define the maximum possible — they do not grant access, only limit what can be granted. IAM Policies are the actual controls. The correct combination is: SCPs restrict unauthorized regions, unapproved services, and high-risk actions (e.g., disabling CloudTrail, removing MFA from root) for the entire organization; IAM Policies implement specific least privilege per workload. Trying to do everything via SCPs results in large, brittle policies. Trying to do everything via IAM without SCPs leaves gaps that an account administrator can exploit.
Alternatives for Developer Access to Internal Tools
Corporate VPN (current model)
- Familiar to operations teams
- No additional per-request latency after connection established
- Broad network access after authentication — trivial lateral movement
- No per-request device posture evaluation
- Persistent session not immediately revoked on compromise
- Granular access logs difficult to correlate with specific actions
Rejected for new access. Kept only as emergency fallback with additional MFA.
AWS Verified Access
- Per-request identity + device posture evaluation
- No persistent network session — zero implicit lateral movement
- Native integration with IAM Identity Center and external IdPs
- Detailed access logs in CloudTrail
- Additional per-request latency (estimate: 5–15ms)
- Requires integrated device posture agent
- Cost per active endpoint hour
Selected. Best security/operation ratio for human access to internal tools.
Bastion Host with Session Manager
- No exposed SSH port — access via AWS Systems Manager
- Sessions audited in CloudTrail
- Low cost
- Still grants network access from bastion after authentication
- No context evaluation or device posture
- Does not scale for access to internal web applications
Accepted only for emergency access to specific EC2 instances, not as a general solution.
Decision: VPC Lattice as Inter-Service Control Plane
Services in multiple AWS accounts need to communicate in an authenticated and authorized manner. Options are: VPC peering with Security Groups (current model), team-managed service mesh (Istio/App Mesh), or AWS-managed VPC Lattice.
Adopt VPC Lattice as the control plane for inter-service communication. IAM role-based authorization policies. SigV4 for per-request authentication. Gradual migration: new services use Lattice by default; existing services migrate by business domain.
- Eliminates Security Group rules between VPCs for service communication
- All inter-service communication is automatically audited via CloudTrail
- Teams need to learn to write resource policies for VPC Lattice services
- Additional cost per request processed in Lattice (evaluate impact for high-volume services)
Phased Rollout Plan
- 1
Phase 0 — Foundation (Weeks 1–4)
Enable CloudTrail with Data Events across all accounts via Organizations. Activate GuardDuty and Security Hub with aggregation in the security account. Run IAM Access Analyzer across all accounts and document existing findings (baseline). Implement guardrail SCPs: block unauthorized regions, require MFA for sensitive actions, prevent disabling audit services. No access changes yet — visibility only.
- 2
Phase 1 — Centralized Identity (Weeks 5–8)
Configure IAM Identity Center with federation to the corporate IdP. Map existing IdP groups to least-privilege Permission Sets. Migrate human access to SSO — eliminate IAM users with long-lived credentials for people. Implement elevated access flow with approval and 4-hour TTL. Communicate changes to engineering teams with documentation and 2-week transition period.
- 3
Phase 2 — Adaptive Access (Weeks 9–14)
Deploy AWS Verified Access for priority internal tools (observability dashboards, admin consoles). Integrate with existing MDM/EDR agent for device posture evaluation. Define access policies per tool: identity + MFA + managed device. Decommission VPN access for migrated tools. Keep VPN only as emergency fallback with additional MFA and automatic alerting.
- 4
Phase 3 — Service Segmentation (Weeks 15–24)
Enable VPC Lattice in the shared network account. Migrate new services to use Lattice by default. Migrate existing services by business domain, starting with lowest operational risk. Enable IAM Database Authentication on RDS/Aurora with RDS Proxy. Remove inter-VPC Security Group rules as services migrate. Validate with IAM Access Analyzer that no residual permissions remain.
- 5
Phase 4 — Automation and Maturity (Weeks 25–32)
Implement response automation via EventBridge: automatic revocation of sessions with high risk score in GuardDuty, quarantine of IAM roles with anomalous behavior. Implement automated periodic permission review (Access Reviews) with report to team managers. Integrate Security Hub findings with corporate SIEM. Establish quarterly review process for SCPs and Permission Sets. Document incident response runbooks for Zero Trust scenarios.
Critical Risks and Mitigations
Risk 1: Accidental lockout during identity migration. Migrating IAM users to SSO is the highest operational risk moment. A group mapping error can leave entire teams without access to production accounts. Mitigation: maintain emergency IAM users (break-glass) with credentials stored in AWS Secrets Manager, accessible only with dual approval and automatic alerting. Test each Permission Set in staging before applying to production. Risk 2: Performance regression with VPC Lattice. High-volume services (>10k req/s) may experience latency and cost impact with the addition of per-request SigV4 authentication. Mitigation: mandatory benchmarking before migrating critical services. For services where cost is prohibitive, evaluate VPC endpoints with resource policies as a lower-cost alternative. Risk 3: Cultural resistance from engineering teams. Zero Trust adds real friction to the development workflow. Developers accustomed to accessing any VPC resource via SSH will resist. Mitigation: invest in tooling that makes correct access easier than incorrect access. SSO CLI wrapper, updated onboarding documentation, and security champions in each team. Risk 4: False sense of security. Zero Trust at the infrastructure layer does not protect against application vulnerabilities, code injection, or compromise of corporate IdP credentials. If the IdP is compromised, the entire chain of trust is compromised. Mitigation: Zero Trust is a layer, not a complete solution. It must be combined with application security, regular penetration testing, and monitoring of the IdP itself.
AWS Well-Architected: Design Assessment
Security
Strong. Identity as primary perimeter, verifiable least privilege, continuous audit, automated detection. Aligned with all Security Pillar design principles: implement a strong identity foundation, enable traceability, apply security at all layers, automate security best practices.
Reliability
Moderate. VPC Lattice and Verified Access are managed services with AWS SLAs. The risk is dependency on external IdP availability for human access — mitigated with break-glass IAM. Testing IdP failure scenarios is mandatory.
Sustainability
Neutral. Elimination of bastion hosts reduces idle EC2 instances. No significant sustainability impact.
Success Metrics and Targets
- IAM users with long-lived credentials (humans)
- Target: 0 (except break-glass, monitored)
- IAM roles with *:* permissions
- Target: 0 — automatically detected by Config Rule
- CloudTrail Data Events coverage
- Target: 100% of production accounts
- Internal services communicating via VPC Lattice or VPC Endpoint
- Target: 100% by end of Phase 3
- Mean time to detect excessive permission (MTTD)
- Target: <24h via continuous IAM Access Analyzer
- Response time to critical GuardDuty finding
- Target: <5 min for automatic revocation via EventBridge
- Human access to internal tools via Verified Access
- Target: 100% of internal tools by end of Phase 2
- Full rollout duration (estimate)
- 32 weeks (8 months) for organization of 50–200 engineers
After working on financial systems where the cost of a breach is existential, I've learned that Zero Trust is not an architecture you implement all at once — it's a posture you adopt incrementally, starting with the highest-impact controls with the least friction. The most common mistake I see is teams trying to implement everything at once and stalling on cultural resistance. My approach: start with visibility (Phase 0), not restriction. You need to understand the current state before changing anything. IAM Access Analyzer in discovery mode will reveal permissions nobody knew existed. That data is your political ally for justifying the changes that follow. On VPC Lattice: I would adopt it, but with an important caveat. For existing high-volume services, I would measure the real cost before migrating. Lattice's pricing model (per request + per GB) can be surprising for chatty services. In some cases, VPC endpoints with restrictive resource policies deliver 80% of the security benefit at 10% of the cost. Don't let perfect be the enemy of good. On Verified Access: it's genuinely good for the problem it solves, but success depends almost entirely on the quality of integration with the device posture agent. If the corporate MDM/EDR cannot report posture reliably, you'll end up with policies that don't actually evaluate devices — which is worse than admitting the limitation. Be honest about what you can actually verify. The most important lesson: Zero Trust fails when it becomes an isolated security project. It needs to be a platform project, with product engineers and SREs
Verdict
Zero Trust on AWS is not a feature you turn on — it's a mental model shift that translates into specific architectural choices. The design proposed here is pragmatic: it uses AWS managed services (Identity Center, Verified Access, VPC Lattice) to deliver Zero Trust principles without requiring the team to operate complex security infrastructure. The trade-off is additional cost and marginal latency in exchange for elimination of implicit trust, complete traceability, and verifiable least privilege. The four-phase rollout is deliberately conservative. Security that disrupts operations is not security — it's an incident. Starting with visibility before restriction, migrating by business domain, and maintaining fallback paths during transitions are practices most teams ignore in the rush to show progress. What this design does not solve: application vulnerabilities, corporate IdP compromise, and insider threats from people with legitimate access. Zero Trust reduces the attack surface and blast radius of a compromise, but does not eliminate risk. It is a necessary layer, not a complete solution. Any architect who sells you Zero Trust as a complete solution is selling you marketing, n