Cognito Multi-Region: Migrating Identity to High Availability
Listen to article
generated on playGenerated only on first play

Resilient multi-Region authentication as a critical service
Authentication is critical infrastructure — a regional Cognito failure brings down the entire user journey. With Cognito multi-Region replication now available, there is a concrete path to elevating the identity plane to the same resilience level we already demand from databases and queues. In this article, I document the migration journey, the architecture decisions, and the risks that need active management.
For years, we accepted a silent asymmetry in our architectures: databases with multi-Region replication, queues with cross-Region replication, but the identity plane — the Cognito User Pool — running in a single Region with no automated failover. When Amazon Cognito multi-Region replication was announced, my first reaction was not excitement but relief: we can finally close that gap without building a brittle homegrown solution. The journey to get there, however, is more surgical than it appears.
The Starting Point: Identity as a Single Point of Failure
In financial systems, the Cognito User Pool carries far more than credentials. It stores custom attributes that map to risk profiles, groups that control downstream authorization scopes, and Lambda triggers that execute business logic — CPF validation, claim enrichment with compliance data, audit logging for regulatory requirements. When that pool fails in a Region, the impact is not just "users can't log in": the entire authorization chain breaks, APIs protected by Cognito Authorizer in API Gateway return 401, and onboarding flows get stuck in inconsistent states.
The pattern I saw frequently before native replication was a combination of two independent pools with eventual synchronization via EventBridge and Lambda — what we called internally a "Cognito Bridge". The problem is that this pattern has at least three blind spots: race conditions during password synchronization (the bcrypt hash changes on every reset), custom attributes that silently diverge between regions, and Lambda triggers that need to be kept in manual sync. The operational cost was high and confidence in consistency was low. Native replication solves exactly that layer of accidental complexity.
The Migration Journey: Six Decisions in Sequence
- 1
1. Audit the Existing Pool
Before any movement, inventory everything the current pool carries: custom attributes (schema is immutable after creation), groups and their permissions, Lambda triggers and their VPC/IAM dependencies, app clients and their OAuth flows, custom domain and associated ACM certificates. Any unmapped custom attribute now will be a painful surprise later — a User Pool schema cannot be changed after creation, so replication must start from a pool with the correct schema.
- 2
2. Cross-Region KMS Strategy
Cognito multi-Region replication uses KMS for encryption at rest. The critical decision here is: KMS Multi-Region Keys (mrk-) versus independent keys per region. I strongly recommend MRKs — they share the same cryptographic key material replicated by KMS across regions, meaning tokens and data encrypted by the primary pool can be decrypted by the replica pool without re-encryption. With independent keys, you would need an additional re-wrapping layer that adds latency and complexity. Configure the MRK key policy to allow only the Cognito service principal (
cognito-idp.amazonaws.com) with anaws:SourceAccountcondition restricted to your account — never a wildcard permission. - 3
3. Pool Replication and Schema Synchronization
With native replication enabled, Cognito automatically replicates users, groups, attributes, and credentials (password hashes included) to the secondary region. The key attention point is the initial replication state: for pools with millions of users, the initial replication can take hours. During this period, the replica pool exists but is not ready to receive authentication traffic. Monitor the CloudWatch metric
UserPoolReplicationStatusand only move traffic after confirmation of complete synchronization. Configure an alarm with a threshold of 0 onReplicationLagbefore any failover test. - 4
4. Lambda Triggers: The Most Underestimated Problem
Lambda triggers associated with the primary pool are not automatically replicated — they need to exist in the secondary region with the same logic and the same dependencies. This means: Lambda functions deployed in the secondary region, resource-based policy permissions updated for the replica pool, and environment variables pointing to regional endpoints (DynamoDB, Secrets Manager, RDS Proxy) correctly configured for the secondary region. A Pre-Token-Generation trigger that enriches claims with data from a DynamoDB Global Table must point to the correct regional endpoint — not to the primary region endpoint, which would create a cross-region dependency and negate the failover benefit.
- 5
5. Routing with Route 53 and Health Checks
Cognito exposes distinct regional endpoints for the primary pool and the replica. The routing strategy I recommend is Route 53 with Failover Routing Policy: the primary record points to the Cognito endpoint of the main region, with an active health check monitoring
https://<domain>.auth.<region>.amazoncognito.com/health. The TTL should be 60 seconds — not less, to avoid DNS flood, not more, to ensure fast failover. For applications that use the Cognito SDK directly (not via custom domain), an abstraction layer is needed — a regional API Gateway with Lambda proxy that resolves the correct endpoint based on health check logic, or use of AWS Global Accelerator for custom endpoints. - 6
6. Failover Testing and Operational Runbook
Untested failover is non-existent failover. The test must simulate real failure: block traffic to the primary region via Security Group or NACLs, observe the Route 53 health check fail (typically 3 consecutive checks with 10s interval = 30s detection), and measure the time to the first successful login in the secondary region. Document in the runbook: the replica pool ARN, the app client credentials in the secondary region, the procedure to promote the replica to primary if necessary, and the re-synchronization process after primary region recovery. This runbook must be executed in a quarterly GameDay — not just reviewed.
Target Architecture: The Resilient Identity Plane
The target architecture that emerges from this migration has three distinct layers. The first is the identity data plane: primary Cognito User Pool in us-east-1 with an active replica in us-west-2, both using the same KMS MRK, with synchronous replication of users, groups, and credentials managed by the service. The second is the trigger execution plane: Lambda functions mirrored in both regions, with access to DynamoDB Global Tables (configured with tables in both regions and automatic replication), Secrets Manager with secret replication enabled, and RDS Proxy pointing to Aurora Global Database with a promotable read replica.
The third layer — frequently neglected — is the identity observability plane. Critical metrics to monitor in both regions: SignInSuccesses, SignInThrottles, TokenRefreshSuccesses, and the custom metric IdentityPlaneRPO that measures replication lag in seconds. The latter should have an SLO of < 30 seconds RPO under normal conditions, with an alarm at 60 seconds and PagerDuty at 120 seconds. Cross-region observability is implemented via CloudWatch Cross-Account Cross-Region dashboards — a single panel that aggregates identity metrics from both regions, with automatic annotations when failover is detected.
Multi-Region Identity Architecture with Cognito
Active-active authentication flow with automatic failover via Route 53, Cognito identity replication, mirrored Lambda triggers, and unified observability.
- Route 53 · Failover Policy TTL=60s
- Health Check · /health endpoint
- Cognito User Pool · Primary (mrk-kms-1)
- Lambda Triggers · Pre-Token / Pre-Auth
- DynamoDB · Global Table (us-east-1)
- API Gateway · Cognito Authorizer
- Cognito User Pool · Replica (mrk-kms-1)
- Lambda Triggers · Mirrored (regional env vars)
- DynamoDB · Global Table (us-west-2)
- API Gateway · Cognito Authorizer
- KMS Multi-Region Key · mrk- shared material
- CloudWatch · Cross-Region Dashboard
Before and After: Measurable Migration Impact
JWT Tokens and the Cross-Region Validation Problem
A detail that frequently escapes multi-Region Cognito migration discussions is JWT token behavior after failover. Tokens issued by the primary pool are signed with the primary pool's RSA key — the primary pool's JWKS endpoint (https://cognito-idp.<region>.amazonaws.com/<pool-id>/.well-known/jwks.json) is the source of truth for validation. After failover to the replica, the replica pool has its own JWKS endpoint.
This creates a transition window: tokens issued before failover, still valid by expiration time (typically 1h for access tokens), need to be validated against the primary pool's JWKS — which may be unavailable precisely because we are in failover. The correct solution is to implement token validation with JWKS caching at multiple layers: the API Gateway Lambda Authorizer must cache public keys from both regions and attempt validation against both JWKS before rejecting the token. The cache should have a 10-minute TTL with asynchronous refresh — never fetch JWKS at request time, as this creates a synchronous cross-region dependency.
For systems that use long-lived tokens (refresh tokens with 30-day validity are common in mobile banking), the token replication strategy is even more critical. Cognito replicates refresh tokens as part of session replication — but the app client ID and client secret need to be identical in both pools (or managed via Secrets Manager with replication enabled) for the refresh flow to work in the secondary region without forcing user re-authentication.
Critical Risks That Need Active Management
1. Immutable schema: The custom attribute schema of a User Pool cannot be changed after creation. If the primary pool has legacy attributes with bad names or incorrect types, the replica inherits those problems. Fix the schema before enabling replication — which may require a full pool migration as a prerequisite.
2. App Clients not automatically replicated: App clients (client IDs and secrets) need to be manually recreated in the replica or via IaC (CloudFormation/Terraform). A missing app client in the replica means the OAuth flow silently fails after failover — no clear error, just 401.
3. Custom domain and ACM certificates: The Cognito custom domain (auth.company.com) requires an ACM certificate in us-east-1 (Cognito requirement for custom domains) even if the replica is in another region. This is a service requirement that does not change with multi-Region replication.
4. Rate limits per region: Cognito has request rate quotas per region (e.g., 120 RPS for InitiateAuth by default). In a failover scenario, 100% of authentication traffic goes to the secondary region — verify the quota is adjusted via Service Quotas before failover, not during.
Governance, Compliance, and the Identity Plane in Regulated Systems
In regulated financial institutions in Brazil — subject to BACEN, LGPD, and for open finance, Joint Resolution 1 — the identity plane has governance requirements that go beyond technical availability. Cognito stores personal data (name, email, CPF in a custom attribute) and authentication data — both classified as sensitive data under LGPD. Multi-Region replication immediately raises the question of data residency: if the replica is in us-west-2 (Oregon), data of Brazilian citizens is being replicated outside Brazil.
The architectural answer to this is not to avoid replication, but to document it properly in the Record of Processing Activities (ROPA) and ensure that security controls are equivalent in both regions. Encryption with KMS MRK satisfies the requirement that data at rest be encrypted with keys under organizational control — as long as the key policy does not allow third-party access. CloudTrail with replication to S3 in both regions (with Object Lock for immutability) ensures the audit trail required by BACEN for authentication systems.
A frequently neglected point: the Pre-Token-Generation trigger that adds claims to the JWT is, in practice, an extension of the authorization plane. If this trigger fails silently (timeout, Lambda throttle), the token is issued without the compliance claims — and downstream, APIs that depend on those claims for authorization decisions make incorrect decisions. Configure TokenValidityUnits and implement mandatory claim validation in the API Gateway Lambda Authorizer as an additional line of defense.
Identity Resilience Strategies: Comparison
| Criterion | Cognito Bridge (Homegrown) | Native Multi-Region Replication | Independent Pool + Manual Sync | |
|---|---|---|---|---|
| RTO | 5-15 min (EventBridge propagation) | < 2 min (Route 53 health check) | 30-60 min (manual process) | — |
| RPO | Minutes (eventual lag) | < 30s (synchronous replication) | Hours (last manual sync) | — |
| Operational Complexity | High (4+ Lambdas, DLQ, reconciliation) | Low (managed by service) | Very High (extensive runbooks) | — |
| Password Consistency | Race condition on resets | Guaranteed (hash replicated) | Not guaranteed | — |
| Additional Cost | ~$200-500/month (sync infra) | ~$50-150/month (replica pool + KMS MRK) | Engineering cost dominant | — |
Alignment with AWS Well-Architected Framework
Security
KMS Multi-Region Keys with policy restricted by aws:SourceAccount and aws:SourceService. No identity data in transit without TLS 1.2+. CloudTrail enabled in both regions with S3 Object Lock for audit immutability. Lambda triggers with least-privilege IAM roles and VPC endpoints for DynamoDB and Secrets Manager access.
Reliability
Multi-Region replication eliminates the identity plane SPOF. RTO < 2 min and RPO < 30s meet SLAs for critical financial systems (Tier-1). Active health checks with Route 53 Failover Policy ensure automatic detection without human intervention.
If I were starting this migration today, I would begin with the Lambda trigger inventory — not the pool itself. In every project I have been part of, triggers are the component with the most hidden dependencies: VPC configs, IAM roles with hardcoded ARNs, environment variables pointing to regional endpoints without abstraction. Cognito's native replication solves the identity data problem elegantly, but if triggers are not ready in the secondary region, the failover will appear to work at the DNS level and silently break in the actual authentication flow. The hard-won lesson: test the complete authentication flow — including Pre-Token-Generation — in the secondary region before considering the migration complete.
Anti-Patterns to Avoid in This Migration
- Enabling multi-Region replication without first auditing and fixing the custom attribute schema — the immutable schema will permanently freeze legacy problems in the replica.
- Using independent KMS keys per region instead of MRKs — this creates the need for data re-wrapping and adds unnecessary latency and complexity to the failover flow.
- Configuring DNS TTL above 60 seconds for the Cognito record — this increases client-perceived failover time beyond the health check detection time.
- Fetching the JWKS endpoint at request time in the Lambda Authorizer — this creates a synchronous cross-region dependency that can bring down token validation exactly when the primary region is degraded.
- Not adjusting
InitiateAuthService Quotas in the secondary region before failover — in a real failover event, the traffic spike will hit the default 120 RPS limit and cause massive throttling.
Verdict: Migrate, But Do It with Engineering Discipline
Amazon Cognito's native multi-Region replication is a step-change for systems that treat authentication as critical infrastructure — and every financial system should. The feature solves the hardest problem (credential and user data consistency across regions) in a managed way, eliminating the need for brittle and expensive homegrown solutions. My recommendation is clear: migrate to native replication if you haven't already. But the migration requires discipline. The prerequisites are not optional: audited and correct schema, KMS MRKs configured with restrictive policies, mirrored Lambda triggers with correct regional dependencies, app clients replicated via IaC, Service Quotas adjusted in the secondary region, and — most importantly — a complete failover test with end-to-end authentication flow before declaring the migration complete. Without these elements, you will have a false sense of resilience: DNS goes to the secondary region, but login silently fails because the trigger can't find the correct DynamoDB or the app client doesn't exist. For Brazilian financial systems, add the governance layer: document the replication in the ROPA, ensure that security controls (KMS, CloudTrail, VPC) are equivalent in both regions, and implement mandatory claim validation in the Lambda Authorizer as a defense against silent trigger failures. Resilient identity is not a one-sprint project — it is a platform capability that needs to be built with the same seriousness as any other Tier-1 component.
References and Further Reading
Architecture intelligence, in your inbox
Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.
- Curated AWS · AI · architecture · market signals
- New architecture studies & deep-dives when they ship
- Sharp summaries — depth without the noise
- No spam · double opt-in · unsubscribe anytime