Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Decision (ADR)Identity platform (cenário)Identidade / ResiliênciaAccepted

ADR: Cognito Multi-Region for Resilient Authentication

Jun 5, 2026 9 min AI-assisted

Listen to study

generated on play

Generated only on first play

On demand

0:000:00

Speed

The MP3 is saved to S3 after the first play.

This ADR examines when and how to adopt multi-region User Pool replication in Amazon Cognito to reduce authentication downtime on identity platforms with high-availability requirements. It covers regional failover, customer-managed KMS keys, user synchronization, session and token impact, custom domains, and customer experience, with explicit reasoning on operational and cost trade-offs.

Authentication is the front gate of any system. When Amazon Cognito becomes unavailable in a region, users cannot log in — and no other part of the architecture matters. This ADR documents the decision to adopt multi-region User Pool replication for an identity platform with a 99.99% SLA, detailing the forces that made the solution necessary, the options evaluated, and the real consequences of the choice.

Scenario Context

System: Centralized identity platform (composite scenario)
Domain: Identity / Resilience
Scale: ~2 million active users, peak of 8,000 logins/min
Primary / secondary regions: us-east-1 (primary), us-west-2 (secondary)
Contractual authentication SLA: 99.99% (≤ 52 min downtime/year)
ADR trigger: 47-min incident in us-east-1 caused full login blockage; SLA nearly breached
Identity stack: Amazon Cognito User Pools, Lambda Triggers, KMS CMK, Route 53, ACM, API Gateway
Authentication model: OAuth 2.0 / OIDC, JWT tokens (ID + Access + Refresh), custom domain auth.example.com
Replication feature (GA): Amazon Cognito multi-Region replication — available from 2024

Context and Forces at Play

For years, Amazon Cognito operated as a regional service with no native User Pool replication mechanism across regions. The available resilience strategy was essentially manual: maintain a second User Pool in another region with users pre-provisioned via periodic export/import, manage failover through DNS, and accept that passwords could not be replicated (Cognito stores only per-user salted hashes, which are not exportable). This created a structural inconsistency window — any user created or who changed their password after the last sync would simply fail to authenticate in the secondary region.

The incident that triggered this ADR lasted 47 minutes in us-east-1. During that period, 100% of login flows failed. Token refresh flows that depended on online validation also failed for clients without valid cached tokens. The impact was amplified because the platform serves both B2C consumers and B2B operators — for the latter, login unavailability blocked critical business operations, not just convenience.

The forces shaping the decision were: (1) Contractual SLA of 99.99% that mathematically cannot tolerate regional incidents without automatic failover; (2) nature of identity data — passwords, MFA seeds, and user attributes are highly sensitive data requiring strict control over where and how they are replicated; (3) session complexity — JWT tokens issued by the primary region need to be validatable in the secondary without forced reissuance; (4) custom domains — auth.example.com is resolved by a CloudFront distribution managed by Cognito, and DNS failover must be coordinated without breaking clients' OAuth PKCE/redirect_uri flows; (5) cost and operational complexity — maintaining two manually synchronized User Pools is brittle and expensive in engineering effort.

The Problem with KMS Keys and Tokens in Multi-Region Scenarios

One aspect frequently underestimated in discussions about Cognito multi-region is the role of KMS keys. When you configure a User Pool with a customer-managed key (CMK) to encrypt sensitive attributes or for advanced token signing, that key is regional by default. If the secondary User Pool uses a different CMK — or if the primary key is not accessible in the secondary region — any data encrypted with the primary key becomes unreadable at failover.

The correct solution is to use KMS Multi-Region Keys (MRKs), available since 2021. An MRK allows the same logical key (same key ID) to exist in multiple AWS regions, with key material synchronized by AWS. This means data encrypted in us-east-1 with the MRK can be decrypted in us-west-2 with the replica of the same MRK — without key material export, maintaining the HSM security model. For this scenario, we configured MRKs in both regions and associated each User Pool with the corresponding regional replica.

Regarding JWT tokens: tokens issued by Cognito are signed with RSA keys managed internally by the service (exposed via the JWKS endpoint). With Cognito's native multi-region replication, the secondary User Pool shares the same issuer (iss claim) and the same signing keys as the primary — which is fundamental. Without this, a token issued by the primary User Pool would be rejected by any validator fetching the JWKS from the secondary, because the keys would differ. Native replication resolves this problem structurally. In manual architectures (two independent User Pools), this problem has no clean solution: either you force re-login at failover, or you implement a validation proxy that tries both JWKS endpoints — both options carry significant user experience cost or operational complexity.

Options Evaluated

Option A: Accept regional risk (status quo)

Pros

Zero additional infrastructure cost
No synchronization operational complexity
Simple custom domain configuration

Cons

99.99% SLA mathematically unachievable without failover
Incident already demonstrated 47-min total downtime impact
Risk of contractual breach with B2B clients

Rejected. The real incident made this option indefensible.

Option B: Two independent User Pools with manual sync via Lambda/EventBridge

Pros

Available before the native replication feature
Full control over what is synchronized

Cons

Passwords cannot be replicated (per-user salted hashes)
Tokens issued by primary are invalid in secondary (different JWKS)
Structural inconsistency window for newly created users
High operational complexity: triggers, DLQs, reconciliation, dual monitoring
Failover forces re-login or complex validation proxy

Rejected. High complexity for incomplete resilience — worst of both worlds.

Option C: Cognito native multi-region replication + MRK + Route 53 ARC

Pros

Full replication of users, attributes, credentials, and JWT signing keys
Tokens issued by primary are valid in secondary (same issuer and JWKS)
MRKs ensure encryption continuity without key material export
DNS failover automatable via Route 53 ARC with health checks
Reduces operational complexity vs. manual synchronization

Cons

Additional cost: replication and dual User Pool execution (estimate: +30-40% on Cognito cost)
Custom domain requires independent configuration per region + regional ACM certificates
Lambda Triggers need to be replicated and manually kept in sync across regions
Active sessions at failover time may require re-authentication if refresh token is not honored
Relatively new feature — edge case behavior still being documented by the community

Accepted. Best balance between real resilience, security, and manageable operational complexity.

Option D: Replace Cognito with self-hosted multi-region identity solution (Keycloak, Auth0)

Pros

Full control over replication, failover, and data model
Provider portability

Cons

Very high migration cost: 2M users, existing OAuth integrations, Lambda Triggers
Self-hosted Keycloak requires own infrastructure, patching, HA — real TCO much higher
Auth0/Okta: per-MAU cost prohibitive at 2M users
Does not solve the immediate problem; minimum 12-18 month timeline

Rejected for this ADR's horizon. Can be reassessed in a future strategic review.

Formal Decision

Accepted

Context

The identity platform operates with a 99.99% SLA and serves 2 million active users. A 47-minute regional incident in us-east-1 nearly breached the contractual SLA and blocked critical B2B client operations. The existing architecture had no authentication failover mechanism. AWS released native multi-region replication for Cognito User Pools, making a structural solution viable.

Decision

Adopt Amazon Cognito User Pool native multi-region replication between us-east-1 (primary) and us-west-2 (secondary), with KMS Multi-Region Keys for encryption continuity, Route 53 Application Recovery Controller (ARC) for automated DNS failover, and custom domains configured independently per region with regional ACM certificates. Lambda Triggers will be managed via IaC (Terraform) with simultaneous deployment in both regions. Target RTO is 5 minutes; RPO is near-zero for user data (continuous replication).

Consequences

POSITIVE: JWT tokens issued in the primary region are valid in the secondary without forced re-authentication, as the issuer and signing keys are shared.
POSITIVE: User data (attributes, credentials, groups) is continuously replicated, eliminating the inconsistency window of the manual approach.
POSITIVE: MRKs ensure data encrypted with the primary key is decryptable in the secondary region without key material export.
NEGATIVE: Active sessions with in-flight refresh tokens at the exact moment of failover may require re-authentication — estimated impact window < 1% of active sessions.
NEGATIVE: Lambda Triggers must be kept in sync across regions via CI/CD pipeline; configuration drift is an operational risk requiring explicit monitoring.
NEGATIVE: Estimated additional cost of 30-40% on the Cognito line of the identity budget (estimate based on public pricing model; validate with AWS Cost Calculator for actual scale).

Custom Domains, Sessions, and Customer Experience at Failover

The custom domain is one of the most delicate aspects of Cognito failover. When you configure auth.example.com as a User Pool's custom domain, Cognito provisions a service-managed CloudFront distribution and associates your ACM certificate with it. That certificate must be in us-east-1 regardless of where the User Pool is — because CloudFront is global and validates certificates only from us-east-1. In the secondary region (us-west-2), you need a second ACM certificate in us-east-1 (or use the same one if the domain is identical) and configure a second custom domain on the secondary User Pool.

The DNS failover strategy we adopted uses Route 53 with two weighted CNAME records (weight 100/0 in normal operation) pointing to each region's CloudFront distributions. Route 53 ARC monitors health checks against the /oauth2/token endpoint of each region. When the primary health check fails for 3 consecutive 10-second periods, ARC automatically executes failover, changing the weight to 0/100. Record TTL is set to 60 seconds — a conscious trade-off between failover speed and DNS resolution load.

For customer experience, the most visible impact at failover is for users mid-authorization flow (OAuth redirect in progress). These flows are stateless from Cognito's perspective, but the state parameter and code_verifier (PKCE) are on the client — so the redirect to the callback URL works normally even after failover, as long as the redirect_uri is registered in the secondary User Pool. This requires the redirect_uris list to be kept identical in both User Pools — an operational checklist item that must be automated via IaC to prevent drift.

Multi-Region Failover Architecture — Authentication Flow

Diagram shows the normal flow (us-east-1 active) and the failover path to us-west-2, including User Pool replication, MRKs, and DNS control via Route 53 ARC.

👤 Client Layer

Browser / Mobile · OAuth 2.0 + PKCE
Application · API Gateway + Lambda

🌐 DNS & Routing

Route 53 · auth.example.com · Weighted CNAME
Route 53 ARC · Health Check · /oauth2/token

🔐 us-east-1 — Primary

CloudFront · auth.example.com · (Cognito-managed)
Cognito User Pool · us-east-1 (Primary) · OIDC Issuer
Lambda Triggers · Pre-Token / Post-Auth · us-east-1
KMS MRK · us-east-1 · mrk-xxxxxxxx

🔐 us-west-2 — Secondary (Standby)

CloudFront · auth.example.com · (Cognito-managed)
Cognito User Pool · us-west-2 (Replica) · Same Issuer + JWKS
Lambda Triggers · Pre-Token / Post-Auth · us-west-2
KMS MRK Replica · us-west-2 · mrk-xxxxxxxx (same)

📦 Shared / Control Plane

Cognito Replication · Continuous Sync · Users + Credentials
Terraform IaC · Dual-Region Deploy · Triggers + Config
CloudWatch · Alarms + Dashboards · Both Regions

Well-Architected Assessment

Security

KMS Multi-Region Keys maintain the HSM security model without key material export. Sensitive user attributes remain encrypted at rest in both regions. Least privilege is applied: Lambda Triggers have separate IAM roles per region, without unnecessary cross-region permissions. MFA and password policies are replicated along with the User Pool.

Reliability

Target RTO of 5 minutes via Route 53 ARC with 30-second health checks. Near-zero RPO for user data with native continuous replication. Quarterly Game Days validate the failover procedure. The design eliminates Cognito as a regional SPOF for the authentication flow.

Implementation Plan

1
Phase 1 — Foundation (Weeks 1-2)
Create KMS Multi-Region Key in us-east-1 and replicate to us-west-2. Update primary User Pool to use the MRK. Validate that existing user data remains accessible. Configure Terraform modules for dual-region management.
2
Phase 2 — Replication (Weeks 3-4)
Enable Cognito native multi-region replication. Wait for initial synchronization (time proportional to user volume). Validate integrity of user sample in secondary region. Replicate Lambda Triggers via CI/CD pipeline with simultaneous deployment.
3
Phase 3 — Domain and DNS (Week 5)
Provision ACM certificate in us-east-1 for auth.example.com (or reuse existing if domain is identical). Configure custom domain on secondary User Pool. Create weighted Route 53 records (100/0). Configure Route 53 ARC with health checks against /oauth2/token of both regions.
4
Phase 4 — Validation and Game Day (Week 6)
Execute controlled Game Day: simulate us-east-1 failure, measure actual RTO, validate that existing tokens are accepted in secondary, validate complete OAuth flows (authorization code + PKCE), validate MFA. Document results and adjust TTLs and health check thresholds as needed.

My Perspective — What I'd Do and What I Learned

Senior Solutions Architect

Cognito's native multi-region replication is a genuinely important addition to AWS's resilience portfolio — but it doesn't solve everything automatically, and it's easy to underestimate the operational friction points. The point that concerns me most in real implementations is Lambda Trigger drift. Cognito replication synchronizes user data, not application logic. If you have Pre-Token Generation triggers that add custom claims, or Post-Authentication triggers that log events to a SIEM, those need to exist and function identically in both regions. In organizations with less mature deployment pipelines, I've seen cases where the secondary region ran weeks with outdated triggers — meaning that in a real failover, issued tokens would have different claims than expected by downstream services. This is a silent failure that only appears in production, at the worst possible moment. Automate trigger deployment as part of the same pipeline that deploys the User Pool. No exceptions. On KMS MRKs: the decision to use MRKs is correct and necessary, but there's a security detail that needs to be explicit in the key policy. The MRK replica in us-west-2 should have a key policy that restricts use to the specific secondary User Pool — not a permissive policy allowing any principal in the account. In financial systems, auditors will question this, and the answer needs to be documented. Finally, on the cost trade-off: the argument that a 30-40% increase in Cognito cost is justified by the SLA is correct but incomplete. The real cost also includes engineering time to maintain configuration parity between regions, quarterly Game Days, and dual monitoring. This recurring operational cost is frequently underestimated in ROI analyses. For systems with fewer than ~500k MAUs where Cognito cost is small in absolute terms, the operational overhead may dominate. For 2M MAUs with a contractual SLA, the math clearly works out — but be honest about the total cost when presenting the decision to stakeholders.

Comparison: Before vs. After Multi-Region Architecture

	Dimension	Before (Single Region)	After (Native Multi-Region)
Authentication RTO	~47 min (observed in incident)	~5 min (target, via Route 53 ARC)	—
User data RPO	Hours (last manual export)	Near-zero (continuous replication)	—
Token validity at failover	Invalid (different JWKS)	Valid (same issuer and JWKS)	—
Credential consistency	Partial (passwords not replicable)	Complete (native replication includes credentials)	—
Operational complexity	Low (single region)	Medium (dual region, IaC mandatory)	—
Relative identity cost	Baseline	+30-40% estimated (Cognito + MRK + ARC)	—

Verdict

Amazon Cognito's native multi-region replication solves a problem that, until recently, had no clean solution on the platform. For systems with a 99.99% or higher authentication SLA, with user volumes that make migration to alternatives prohibitive, and with compliance requirements demanding control over identity data replication, this is the correct architecture — and this ADR documents why. The three points that define the success or failure of this architecture are: (1) MRKs correctly configured with restrictive per-region key policies — without this, failover may fail to decrypt data or, worse, fail silently; (2) Lambda Triggers in parity guaranteed by CI/CD pipeline — token logic drift is a failure that only surfaces in production under pressure; (3) quarterly Game Days with real failover — a 5-minute RTO is a target, not a guarantee, until validated under real conditions. What this ADR does not resolve: the complexity of custom domains with multiple environments (staging, prod, sandbox) sharing the same Cognito — in those cases, the proliferation of replicated User Pools can become a governance problem. And it does not resolve the long-term strategic question of dependency on a single managed identity provider. Those are decisions for future ADRs, but they should be on the radar of any architect responsible for identity platforms at scale.

References

Amazon Cognito multi-Region replication — AWS News Blog Amazon Cognito — Product Page AWS Well-Architected Framework — Security Pillar AWS KMS Multi-Region Keys — Developer Guide Route 53 Application Recovery Controller — Developer Guide Amazon Cognito User Pools — Custom Domains

#cognito#multi-region#identity#resilience#kms#failover#authentication#aws

Case sources

AWS News Blog — Cognito multi-Region replication Amazon Cognito AWS Well-Architected — Security Pillar

Written with AI assistance from the public case and my architect's reading.

Decision (ADR)Identity platform (cenário)Identidade / ResiliênciaAccepted

ADR: Cognito Multi-Region for Resilient Authentication

Jun 5, 2026 9 min AI-assisted

Listen to study

generated on play

Generated only on first play

On demand

0:000:00

Speed

The MP3 is saved to S3 after the first play.

Scenario Context

System: Centralized identity platform (composite scenario)
Domain: Identity / Resilience
Scale: ~2 million active users, peak of 8,000 logins/min
Primary / secondary regions: us-east-1 (primary), us-west-2 (secondary)
Contractual authentication SLA: 99.99% (≤ 52 min downtime/year)
ADR trigger: 47-min incident in us-east-1 caused full login blockage; SLA nearly breached
Identity stack: Amazon Cognito User Pools, Lambda Triggers, KMS CMK, Route 53, ACM, API Gateway
Authentication model: OAuth 2.0 / OIDC, JWT tokens (ID + Access + Refresh), custom domain auth.example.com
Replication feature (GA): Amazon Cognito multi-Region replication — available from 2024

Context and Forces at Play

The Problem with KMS Keys and Tokens in Multi-Region Scenarios

Options Evaluated

Option A: Accept regional risk (status quo)

Pros

Zero additional infrastructure cost
No synchronization operational complexity
Simple custom domain configuration

Cons

99.99% SLA mathematically unachievable without failover
Incident already demonstrated 47-min total downtime impact
Risk of contractual breach with B2B clients

Rejected. The real incident made this option indefensible.

Option B: Two independent User Pools with manual sync via Lambda/EventBridge

Pros

Available before the native replication feature
Full control over what is synchronized

Cons

Passwords cannot be replicated (per-user salted hashes)
Tokens issued by primary are invalid in secondary (different JWKS)
Structural inconsistency window for newly created users
High operational complexity: triggers, DLQs, reconciliation, dual monitoring
Failover forces re-login or complex validation proxy

Rejected. High complexity for incomplete resilience — worst of both worlds.

Option C: Cognito native multi-region replication + MRK + Route 53 ARC

Pros

Full replication of users, attributes, credentials, and JWT signing keys
Tokens issued by primary are valid in secondary (same issuer and JWKS)
MRKs ensure encryption continuity without key material export
DNS failover automatable via Route 53 ARC with health checks
Reduces operational complexity vs. manual synchronization

Cons

Additional cost: replication and dual User Pool execution (estimate: +30-40% on Cognito cost)
Custom domain requires independent configuration per region + regional ACM certificates
Lambda Triggers need to be replicated and manually kept in sync across regions
Active sessions at failover time may require re-authentication if refresh token is not honored
Relatively new feature — edge case behavior still being documented by the community

Accepted. Best balance between real resilience, security, and manageable operational complexity.

Option D: Replace Cognito with self-hosted multi-region identity solution (Keycloak, Auth0)

Pros

Full control over replication, failover, and data model
Provider portability

Cons

Very high migration cost: 2M users, existing OAuth integrations, Lambda Triggers
Self-hosted Keycloak requires own infrastructure, patching, HA — real TCO much higher
Auth0/Okta: per-MAU cost prohibitive at 2M users
Does not solve the immediate problem; minimum 12-18 month timeline

Rejected for this ADR's horizon. Can be reassessed in a future strategic review.

Formal Decision

Accepted

Context

Decision

Consequences

POSITIVE: JWT tokens issued in the primary region are valid in the secondary without forced re-authentication, as the issuer and signing keys are shared.
POSITIVE: User data (attributes, credentials, groups) is continuously replicated, eliminating the inconsistency window of the manual approach.
POSITIVE: MRKs ensure data encrypted with the primary key is decryptable in the secondary region without key material export.
NEGATIVE: Active sessions with in-flight refresh tokens at the exact moment of failover may require re-authentication — estimated impact window < 1% of active sessions.
NEGATIVE: Lambda Triggers must be kept in sync across regions via CI/CD pipeline; configuration drift is an operational risk requiring explicit monitoring.
NEGATIVE: Estimated additional cost of 30-40% on the Cognito line of the identity budget (estimate based on public pricing model; validate with AWS Cost Calculator for actual scale).

Custom Domains, Sessions, and Customer Experience at Failover

Multi-Region Failover Architecture — Authentication Flow

Diagram shows the normal flow (us-east-1 active) and the failover path to us-west-2, including User Pool replication, MRKs, and DNS control via Route 53 ARC.

👤 Client Layer

Browser / Mobile · OAuth 2.0 + PKCE
Application · API Gateway + Lambda

🌐 DNS & Routing

Route 53 · auth.example.com · Weighted CNAME
Route 53 ARC · Health Check · /oauth2/token

🔐 us-east-1 — Primary

CloudFront · auth.example.com · (Cognito-managed)
Cognito User Pool · us-east-1 (Primary) · OIDC Issuer
Lambda Triggers · Pre-Token / Post-Auth · us-east-1
KMS MRK · us-east-1 · mrk-xxxxxxxx

🔐 us-west-2 — Secondary (Standby)

CloudFront · auth.example.com · (Cognito-managed)
Cognito User Pool · us-west-2 (Replica) · Same Issuer + JWKS
Lambda Triggers · Pre-Token / Post-Auth · us-west-2
KMS MRK Replica · us-west-2 · mrk-xxxxxxxx (same)

📦 Shared / Control Plane

Cognito Replication · Continuous Sync · Users + Credentials
Terraform IaC · Dual-Region Deploy · Triggers + Config
CloudWatch · Alarms + Dashboards · Both Regions

Well-Architected Assessment

Security

Reliability

Implementation Plan

1
Phase 1 — Foundation (Weeks 1-2)
Create KMS Multi-Region Key in us-east-1 and replicate to us-west-2. Update primary User Pool to use the MRK. Validate that existing user data remains accessible. Configure Terraform modules for dual-region management.
2
Phase 2 — Replication (Weeks 3-4)
Enable Cognito native multi-region replication. Wait for initial synchronization (time proportional to user volume). Validate integrity of user sample in secondary region. Replicate Lambda Triggers via CI/CD pipeline with simultaneous deployment.
3
Phase 3 — Domain and DNS (Week 5)
Provision ACM certificate in us-east-1 for auth.example.com (or reuse existing if domain is identical). Configure custom domain on secondary User Pool. Create weighted Route 53 records (100/0). Configure Route 53 ARC with health checks against /oauth2/token of both regions.
4
Phase 4 — Validation and Game Day (Week 6)
Execute controlled Game Day: simulate us-east-1 failure, measure actual RTO, validate that existing tokens are accepted in secondary, validate complete OAuth flows (authorization code + PKCE), validate MFA. Document results and adjust TTLs and health check thresholds as needed.

My Perspective — What I'd Do and What I Learned

Senior Solutions Architect

Comparison: Before vs. After Multi-Region Architecture

	Dimension	Before (Single Region)	After (Native Multi-Region)
Authentication RTO	~47 min (observed in incident)	~5 min (target, via Route 53 ARC)	—
User data RPO	Hours (last manual export)	Near-zero (continuous replication)	—
Token validity at failover	Invalid (different JWKS)	Valid (same issuer and JWKS)	—
Credential consistency	Partial (passwords not replicable)	Complete (native replication includes credentials)	—
Operational complexity	Low (single region)	Medium (dual region, IaC mandatory)	—
Relative identity cost	Baseline	+30-40% estimated (Cognito + MRK + ARC)	—

Verdict

References

#cognito#multi-region#identity#resilience#kms#failover#authentication#aws

Case sources

AWS News Blog — Cognito multi-Region replication Amazon Cognito AWS Well-Architected — Security Pillar

Written with AI assistance from the public case and my architect's reading.

Listen to study

Scenario Context

Context and Forces at Play

The Problem with KMS Keys and Tokens in Multi-Region Scenarios

Options Evaluated

Option A: Accept regional risk (status quo)

Option B: Two independent User Pools with manual sync via Lambda/EventBridge

Option C: Cognito native multi-region replication + MRK + Route 53 ARC

Option D: Replace Cognito with self-hosted multi-region identity solution (Keycloak, Auth0)

Formal Decision

Custom Domains, Sessions, and Customer Experience at Failover

Multi-Region Failover Architecture — Authentication Flow

Well-Architected Assessment

Security

Reliability

Implementation Plan

Phase 1 — Foundation (Weeks 1-2)

Phase 2 — Replication (Weeks 3-4)

Phase 3 — Domain and DNS (Week 5)

Phase 4 — Validation and Game Day (Week 6)

Comparison: Before vs. After Multi-Region Architecture

Verdict

References

Listen to study

Scenario Context

Context and Forces at Play

The Problem with KMS Keys and Tokens in Multi-Region Scenarios

Options Evaluated

Option A: Accept regional risk (status quo)

Option B: Two independent User Pools with manual sync via Lambda/EventBridge

Option C: Cognito native multi-region replication + MRK + Route 53 ARC

Option D: Replace Cognito with self-hosted multi-region identity solution (Keycloak, Auth0)

Formal Decision

Custom Domains, Sessions, and Customer Experience at Failover

Multi-Region Failover Architecture — Authentication Flow

Well-Architected Assessment

Security

Reliability

Implementation Plan

Phase 1 — Foundation (Weeks 1-2)

Phase 2 — Replication (Weeks 3-4)

Phase 3 — Domain and DNS (Week 5)

Phase 4 — Validation and Game Day (Week 6)

Comparison: Before vs. After Multi-Region Architecture

Verdict

References