ADR: Cognito Multi-Region for Resilient Authentication
Listen to study
generated on playGenerated only on first play
This ADR examines when and how to adopt multi-region User Pool replication in Amazon Cognito to reduce authentication downtime on identity platforms with high-availability requirements. It covers regional failover, customer-managed KMS keys, user synchronization, session and token impact, custom domains, and customer experience, with explicit reasoning on operational and cost trade-offs.
Authentication is the front gate of any system. When Amazon Cognito becomes unavailable in a region, users cannot log in — and no other part of the architecture matters. This ADR documents the decision to adopt multi-region User Pool replication for an identity platform with a 99.99% SLA, detailing the forces that made the solution necessary, the options evaluated, and the real consequences of the choice.
Scenario Context
- System
- Centralized identity platform (composite scenario)
- Domain
- Identity / Resilience
- Scale
- ~2 million active users, peak of 8,000 logins/min
- Primary / secondary regions
- us-east-1 (primary), us-west-2 (secondary)
- Contractual authentication SLA
- 99.99% (≤ 52 min downtime/year)
- ADR trigger
- 47-min incident in us-east-1 caused full login blockage; SLA nearly breached
- Identity stack
- Amazon Cognito User Pools, Lambda Triggers, KMS CMK, Route 53, ACM, API Gateway
- Authentication model
- OAuth 2.0 / OIDC, JWT tokens (ID + Access + Refresh), custom domain auth.example.com
- Replication feature (GA)
- Amazon Cognito multi-Region replication — available from 2024
Context and Forces at Play
For years, Amazon Cognito operated as a regional service with no native User Pool replication mechanism across regions. The available resilience strategy was essentially manual: maintain a second User Pool in another region with users pre-provisioned via periodic export/import, manage failover through DNS, and accept that passwords could not be replicated (Cognito stores only per-user salted hashes, which are not exportable). This created a structural inconsistency window — any user created or who changed their password after the last sync would simply fail to authenticate in the secondary region.
The incident that triggered this ADR lasted 47 minutes in us-east-1. During that period, 100% of login flows failed. Token refresh flows that depended on online validation also failed for clients without valid cached tokens. The impact was amplified because the platform serves both B2C consumers and B2B operators — for the latter, login unavailability blocked critical business operations, not just convenience.
The forces shaping the decision were: (1) Contractual SLA of 99.99% that mathematically cannot tolerate regional incidents without automatic failover; (2) nature of identity data — passwords, MFA seeds, and user attributes are highly sensitive data requiring strict control over where and how they are replicated; (3) session complexity — JWT tokens issued by the primary region need to be validatable in the secondary without forced reissuance; (4) custom domains — auth.example.com is resolved by a CloudFront distribution managed by Cognito, and DNS failover must be coordinated without breaking clients' OAuth PKCE/redirect_uri flows; (5) cost and operational complexity — maintaining two manually synchronized User Pools is brittle and expensive in engineering effort.
The Problem with KMS Keys and Tokens in Multi-Region Scenarios
One aspect frequently underestimated in discussions about Cognito multi-region is the role of KMS keys. When you configure a User Pool with a customer-managed key (CMK) to encrypt sensitive attributes or for advanced token signing, that key is regional by default. If the secondary User Pool uses a different CMK — or if the primary key is not accessible in the secondary region — any data encrypted with the primary key becomes unreadable at failover.
The correct solution is to use KMS Multi-Region Keys (MRKs), available since 2021. An MRK allows the same logical key (same key ID) to exist in multiple AWS regions, with key material synchronized by AWS. This means data encrypted in us-east-1 with the MRK can be decrypted in us-west-2 with the replica of the same MRK — without key material export, maintaining the HSM security model. For this scenario, we configured MRKs in both regions and associated each User Pool with the corresponding regional replica.
Regarding JWT tokens: tokens issued by Cognito are signed with RSA keys managed internally by the service (exposed via the JWKS endpoint). With Cognito's native multi-region replication, the secondary User Pool shares the same issuer (iss claim) and the same signing keys as the primary — which is fundamental. Without this, a token issued by the primary User Pool would be rejected by any validator fetching the JWKS from the secondary, because the keys would differ. Native replication resolves this problem structurally. In manual architectures (two independent User Pools), this problem has no clean solution: either you force re-login at failover, or you implement a validation proxy that tries both JWKS endpoints — both options carry significant user experience cost or operational complexity.
Options Evaluated
Option A: Accept regional risk (status quo)
- Zero additional infrastructure cost
- No synchronization operational complexity
- Simple custom domain configuration
- 99.99% SLA mathematically unachievable without failover
- Incident already demonstrated 47-min total downtime impact
- Risk of contractual breach with B2B clients
Rejected. The real incident made this option indefensible.
Option B: Two independent User Pools with manual sync via Lambda/EventBridge
- Available before the native replication feature
- Full control over what is synchronized
- Passwords cannot be replicated (per-user salted hashes)
- Tokens issued by primary are invalid in secondary (different JWKS)
- Structural inconsistency window for newly created users
- High operational complexity: triggers, DLQs, reconciliation, dual monitoring
- Failover forces re-login or complex validation proxy
Rejected. High complexity for incomplete resilience — worst of both worlds.
Option C: Cognito native multi-region replication + MRK + Route 53 ARC
- Full replication of users, attributes, credentials, and JWT signing keys
- Tokens issued by primary are valid in secondary (same issuer and JWKS)
- MRKs ensure encryption continuity without key material export
- DNS failover automatable via Route 53 ARC with health checks
- Reduces operational complexity vs. manual synchronization
- Additional cost: replication and dual User Pool execution (estimate: +30-40% on Cognito cost)
- Custom domain requires independent configuration per region + regional ACM certificates
- Lambda Triggers need to be replicated and manually kept in sync across regions
- Active sessions at failover time may require re-authentication if refresh token is not honored
- Relatively new feature — edge case behavior still being documented by the community
Accepted. Best balance between real resilience, security, and manageable operational complexity.
Option D: Replace Cognito with self-hosted multi-region identity solution (Keycloak, Auth0)
- Full control over replication, failover, and data model
- Provider portability
- Very high migration cost: 2M users, existing OAuth integrations, Lambda Triggers
- Self-hosted Keycloak requires own infrastructure, patching, HA — real TCO much higher
- Auth0/Okta: per-MAU cost prohibitive at 2M users
- Does not solve the immediate problem; minimum 12-18 month timeline
Rejected for this ADR's horizon. Can be reassessed in a future strategic review.
Formal Decision
The identity platform operates with a 99.99% SLA and serves 2 million active users. A 47-minute regional incident in us-east-1 nearly breached the contractual SLA and blocked critical B2B client operations. The existing architecture had no authentication failover mechanism. AWS released native multi-region replication for Cognito User Pools, making a structural solution viable.
Adopt Amazon Cognito User Pool native multi-region replication between us-east-1 (primary) and us-west-2 (secondary), with KMS Multi-Region Keys for encryption continuity, Route 53 Application Recovery Controller (ARC) for automated DNS failover, and custom domains configured independently per region with regional ACM certificates. Lambda Triggers will be managed via IaC (Terraform) with simultaneous deployment in both regions. Target RTO is 5 minutes; RPO is near-zero for user data (continuous replication).
- POSITIVE: JWT tokens issued in the primary region are valid in the secondary without forced re-authentication, as the issuer and signing keys are shared.
- POSITIVE: User data (attributes, credentials, groups) is continuously replicated, eliminating the inconsistency window of the manual approach.
- POSITIVE: MRKs ensure data encrypted with the primary key is decryptable in the secondary region without key material export.
- NEGATIVE: Active sessions with in-flight refresh tokens at the exact moment of failover may require re-authentication — estimated impact window < 1% of active sessions.
- NEGATIVE: Lambda Triggers must be kept in sync across regions via CI/CD pipeline; configuration drift is an operational risk requiring explicit monitoring.
- NEGATIVE: Estimated additional cost of 30-40% on the Cognito line of the identity budget (estimate based on public pricing model; validate with AWS Cost Calculator for actual scale).
Custom Domains, Sessions, and Customer Experience at Failover
The custom domain is one of the most delicate aspects of Cognito failover. When you configure auth.example.com as a User Pool's custom domain, Cognito provisions a service-managed CloudFront distribution and associates your ACM certificate with it. That certificate must be in us-east-1 regardless of where the User Pool is — because CloudFront is global and validates certificates only from us-east-1. In the secondary region (us-west-2), you need a second ACM certificate in us-east-1 (or use the same one if the domain is identical) and configure a second custom domain on the secondary User Pool.
The DNS failover strategy we adopted uses Route 53 with two weighted CNAME records (weight 100/0 in normal operation) pointing to each region's CloudFront distributions. Route 53 ARC monitors health checks against the /oauth2/token endpoint of each region. When the primary health check fails for 3 consecutive 10-second periods, ARC automatically executes failover, changing the weight to 0/100. Record TTL is set to 60 seconds — a conscious trade-off between failover speed and DNS resolution load.
For customer experience, the most visible impact at failover is for users mid-authorization flow (OAuth redirect in progress). These flows are stateless from Cognito's perspective, but the state parameter and code_verifier (PKCE) are on the client — so the redirect to the callback URL works normally even after failover, as long as the redirect_uri is registered in the secondary User Pool. This requires the redirect_uris list to be kept identical in both User Pools — an operational checklist item that must be automated via IaC to prevent drift.
Multi-Region Failover Architecture — Authentication Flow
Diagram shows the normal flow (us-east-1 active) and the failover path to us-west-2, including User Pool replication, MRKs, and DNS control via Route 53 ARC.
- Browser / Mobile · OAuth 2.0 + PKCE
- Application · API Gateway + Lambda
- Route 53 · auth.example.com · Weighted CNAME
- Route 53 ARC · Health Check · /oauth2/token
- CloudFront · auth.example.com · (Cognito-managed)
- Cognito User Pool · us-east-1 (Primary) · OIDC Issuer
- Lambda Triggers · Pre-Token / Post-Auth · us-east-1
- KMS MRK · us-east-1 · mrk-xxxxxxxx
- CloudFront · auth.example.com · (Cognito-managed)
- Cognito User Pool · us-west-2 (Replica) · Same Issuer + JWKS
- Lambda Triggers · Pre-Token / Post-Auth · us-west-2
- KMS MRK Replica · us-west-2 · mrk-xxxxxxxx (same)
- Cognito Replication · Continuous Sync · Users + Credentials
- Terraform IaC · Dual-Region Deploy · Triggers + Config
- CloudWatch · Alarms + Dashboards · Both Regions
Well-Architected Assessment
Security
KMS Multi-Region Keys maintain the HSM security model without key material export. Sensitive user attributes remain encrypted at rest in both regions. Least privilege is applied: Lambda Triggers have separate IAM roles per region, without unnecessary cross-region permissions. MFA and password policies are replicated along with the User Pool.
Reliability
Target RTO of 5 minutes via Route 53 ARC with 30-second health checks. Near-zero RPO for user data with native continuous replication. Quarterly Game Days validate the failover procedure. The design eliminates Cognito as a regional SPOF for the authentication flow.
Implementation Plan
- 1
Phase 1 — Foundation (Weeks 1-2)
Create KMS Multi-Region Key in us-east-1 and replicate to us-west-2. Update primary User Pool to use the MRK. Validate that existing user data remains accessible. Configure Terraform modules for dual-region management.
- 2
Phase 2 — Replication (Weeks 3-4)
Enable Cognito native multi-region replication. Wait for initial synchronization (time proportional to user volume). Validate integrity of user sample in secondary region. Replicate Lambda Triggers via CI/CD pipeline with simultaneous deployment.
- 3
Phase 3 — Domain and DNS (Week 5)
Provision ACM certificate in us-east-1 for auth.example.com (or reuse existing if domain is identical). Configure custom domain on secondary User Pool. Create weighted Route 53 records (100/0). Configure Route 53 ARC with health checks against /oauth2/token of both regions.
- 4
Phase 4 — Validation and Game Day (Week 6)
Execute controlled Game Day: simulate us-east-1 failure, measure actual RTO, validate that existing tokens are accepted in secondary, validate complete OAuth flows (authorization code + PKCE), validate MFA. Document results and adjust TTLs and health check thresholds as needed.
Cognito's native multi-region replication is a genuinely important addition to AWS's resilience portfolio — but it doesn't solve everything automatically, and it's easy to underestimate the operational friction points. The point that concerns me most in real implementations is Lambda Trigger drift. Cognito replication synchronizes user data, not application logic. If you have Pre-Token Generation triggers that add custom claims, or Post-Authentication triggers that log events to a SIEM, those need to exist and function identically in both regions. In organizations with less mature deployment pipelines, I've seen cases where the secondary region ran weeks with outdated triggers — meaning that in a real failover, issued tokens would have different claims than expected by downstream services. This is a silent failure that only appears in production, at the worst possible moment. Automate trigger deployment as part of the same pipeline that deploys the User Pool. No exceptions. On KMS MRKs: the decision to use MRKs is correct and necessary, but there's a security detail that needs to be explicit in the key policy. The MRK replica in us-west-2 should have a key policy that restricts use to the specific secondary User Pool — not a permissive policy allowing any principal in the account. In financial systems, auditors will question this, and the answer needs to be documented. Finally, on the cost trade-off: the argument that a 30-40% increase in Cognito cost is justified by the SLA is correct but incomplete. The real cost also includes engineering time to maintain configuration parity between regions, quarterly Game Days, and dual monitoring. This recurring operational cost is frequently underestimated in ROI analyses. For systems with fewer than ~500k MAUs where Cognito cost is small in absolute terms, the operational overhead may dominate. For 2M MAUs with a contractual SLA, the math clearly works out — but be honest about the total cost when presenting the decision to stakeholders.
Comparison: Before vs. After Multi-Region Architecture
| Dimension | Before (Single Region) | After (Native Multi-Region) | |
|---|---|---|---|
| Authentication RTO | ~47 min (observed in incident) | ~5 min (target, via Route 53 ARC) | — |
| User data RPO | Hours (last manual export) | Near-zero (continuous replication) | — |
| Token validity at failover | Invalid (different JWKS) | Valid (same issuer and JWKS) | — |
| Credential consistency | Partial (passwords not replicable) | Complete (native replication includes credentials) | — |
| Operational complexity | Low (single region) | Medium (dual region, IaC mandatory) | — |
| Relative identity cost | Baseline | +30-40% estimated (Cognito + MRK + ARC) | — |
Verdict
Amazon Cognito's native multi-region replication solves a problem that, until recently, had no clean solution on the platform. For systems with a 99.99% or higher authentication SLA, with user volumes that make migration to alternatives prohibitive, and with compliance requirements demanding control over identity data replication, this is the correct architecture — and this ADR documents why. The three points that define the success or failure of this architecture are: (1) MRKs correctly configured with restrictive per-region key policies — without this, failover may fail to decrypt data or, worse, fail silently; (2) Lambda Triggers in parity guaranteed by CI/CD pipeline — token logic drift is a failure that only surfaces in production under pressure; (3) quarterly Game Days with real failover — a 5-minute RTO is a target, not a guarantee, until validated under real conditions. What this ADR does not resolve: the complexity of custom domains with multiple environments (staging, prod, sandbox) sharing the same Cognito — in those cases, the proliferation of replicated User Pools can become a governance problem. And it does not resolve the long-term strategic question of dependency on a single managed identity provider. Those are decisions for future ADRs, but they should be on the radar of any architect responsible for identity platforms at scale.