# Design Doc: Multi-Region Active-Active Payments API

This document proposes a multi-region active-active architecture for a critical payments API, targeting near-zero RTO/RPO, deterministic conflict resolution in data replication, and a phased rollout that minimizes operational risk. The design is grounded in real financial engineering principles and AWS patterns, with explicit trade-offs between consistency, latency, and cost.

- URL: https://fernando.moretes.com/studies/design-doc-multiregion-active-active-payments

- Markdown: https://fernando.moretes.com/studies/design-doc-multiregion-active-active-payments/study.md?lang=en

- Type: Design Doc / RFC

- Company: Payments API (cenário)

- Domain: Resiliência

- Date: 2026-02-01

- Tags: multi-region, active-active, payments, resilience, rto-rpo, conflict-resolution, aws, data-replication

- Reading time: 10 min

---

A payments API that goes down for 4 minutes can cost millions and destroy regulatory trust. This RFC defines how to build a system that survives the total loss of an AWS region without perceptible interruption — and why most naive approaches fail precisely when it matters most.

## The Problem: Payments Resilience Is Not Just High Availability

Most systems that self-declare as 'highly available' are, in practice, active-passive with manual or semi-automatic failover. For a payments API, this is insufficient for reasons that go beyond technical SLAs.

First, the regulatory context. Central banks and card schemes (Visa, Mastercard) require RTO measurable in seconds, not minutes. In Brazil, the Central Bank monitors PIX availability with minute-level granularity and fines participants that exceed downtime thresholds. This transforms resilience from an engineering decision into a compliance obligation.

Second, the nature of the data. Payments are not simple idempotent operations. A transfer involves debiting one account, crediting another, recording in a ledger, sending notifications, and frequently integrating with external systems (clearing houses, correspondent banks). If a region fails mid-saga, state can become inconsistent in ways that are hard to detect and dangerous to correct automatically.

Third, the split-brain problem. In a genuine active-active system, two regions can receive concurrent requests for the same resource — for example, two simultaneous withdrawals from the same account processed in different regions before replication propagates the updated balance. Without a robust conflict resolution mechanism, the system may authorize transactions that should be denied, resulting in fraud or overdraft.

This document does not propose 'high availability'. It proposes a system that maintains transactional correctness even under region failure, with minimal added latency on the happy path and predictable degraded behavior on the failure path.

## System Fact Sheet

- **System:** Payments API (composite scenario)
- **Estimated volume:** 5,000–15,000 TPS at peak
- **Target AWS regions:** us-east-1 (primary), sa-east-1 (secondary), us-west-2 (tertiary)
- **Target RTO:** < 30 seconds for automatic region failover
- **Target RPO:** < 1 second (synchronous replication for financial transactions)
- **Primary stack:** Amazon Aurora Global Database, DynamoDB Global Tables, Amazon Route 53 ARC, API Gateway, EKS, SQS, EventBridge
- **Regulatory domain:** Instant payments (PIX), cards, international transfers
- **Target availability:** 99.995% (~26 minutes of downtime/year)

## Goals and Non-Goals

- ✅ GOAL: Survive total loss of any AWS region without manual intervention and without loss of confirmed transactions
- ✅ GOAL: Guarantee each financial transaction is processed exactly once (exactly-once semantics) even under inter-region network failures
- ✅ GOAL: P99 latency < 200ms for payment authorization on the happy path (no active failover)
- ✅ GOAL: Deterministic and auditable conflict resolution for concurrent writes across regions
- ✅ GOAL: Phased rollout with rollback capability at each phase without customer impact
- ❌ NON-GOAL: Replication to regions outside AWS (on-premises or other clouds)

## Proposed Design: Architecture and Fundamental Decisions

The design starts from a premise I consider non-negotiable for financial systems: **correctness precedes availability**. This means I prefer to reject a transaction with a clear error rather than process it incorrectly. With that established, the design is organized into three layers.

**Layer 1: Routing and Failover (Route 53 ARC + Global Accelerator)**

Amazon Route 53 Application Recovery Controller (ARC) is the central control plane for failover. It maintains continuous readiness checks for each recovery cell (one per region) and allows programmatic failover with a single API call. Global Accelerator routes traffic to the nearest region based on network latency, with endpoint-level health checks every 10 seconds.

The critical decision here is **not to use pure DNS-based failover**. DNS TTLs, even with low values (60s), combined with intermediate resolver caches, make failover timing unpredictable. Global Accelerator operates at the network layer (Anycast), which eliminates this problem and enables failover in seconds.

**Layer 2: Transaction Processing (EKS + Saga Pattern)**

Each region runs an EKS cluster with the payment microservices. The Choreographed Saga pattern is used for transactions involving multiple services, with events published to Amazon EventBridge. Each saga step is idempotent and records its state in DynamoDB Global Tables with conditional writes.

Idempotency is implemented via an **idempotency key** in each request header (UUID v4 generated by the client). The authorization service checks for the key's existence in DynamoDB before processing — if it already exists, it returns the previous result. This guarantees exactly-once even when the client retries after a timeout.

**Layer 3: Persistence and Replication (Aurora Global + DynamoDB Global Tables)**

The financial ledger (balances, confirmed transactions) lives in Amazon Aurora Global Database with a writer instance in us-east-1 and readers in sa-east-1 and us-west-2. Aurora Global's typical replication lag is < 1 second, meeting the defined RPO.

Short-term operational state (idempotency keys, distributed locks, sessions) lives in DynamoDB Global Tables, which uses asynchronous multi-master replication. Here is the central trade-off: DynamoDB Global Tables uses **last-writer-wins (LWW) based on timestamp** as the default conflict resolution policy. For idempotency keys this is acceptable — the first write wins in practice because the key TTL is long enough. For balances, **it is not acceptable** — which is why balances live in Aurora, not DynamoDB.

For the Aurora failover case (promoting reader to writer), Route 53 ARC coordinates the sequence: (1) stop writes in the primary region, (2) wait for zero-lag replication confirmation, (3) promote reader, (4) update endpoints. This process typically takes 60–90 seconds, which only meets the 30-second RTO if step 2 is eliminated — which requires accepting potential sub-second data loss. My recommendation is to keep the real RTO at 90 seconds for database failover and be transparent about this with stakeholders.

## Multi-Region Active-Active Architecture

Flow of a payment transaction in normal operation and behavior during region failover. Each region is an independent recovery cell capable of processing complete transactions.

### 🌐 Global Layer

- Client Mobile/Web (user)
- Global Accelerator Anycast routing (network)
- Route 53 ARC Readiness + Failover (network)
- AWS WAF DDoS + Rate limit (security)

### 🇺🇸 us-east-1 (Primary Writer)

- API Gateway REST + mTLS (frontend)
- EKS Cluster Payment Services (compute)
- Authorization Service (compute)
- Saga Orchestrator EventBridge Pipes (messaging)
- Aurora Global Writer (Primary) (data)
- DynamoDB Global Tables (data)
- SQS FIFO Dead Letter Queue (messaging)

### 🇧🇷 sa-east-1 (Secondary)

- API Gateway REST + mTLS (frontend)
- EKS Cluster Payment Services (compute)
- Authorization Service (compute)
- Aurora Global Reader → Writer* (data)
- DynamoDB Global Tables (data)

### 🔁 Replication & Control

- Aurora Replication < 1s lag (data)
- DynamoDB Replication Async multi-master (data)
- CloudWatch Alarms + Dashboards (security)

### Flows

- client -> ga: HTTPS
- ga -> waf: Anycast
- waf -> apigw1: Normal
- waf -> apigw2: Failover
- arc -> ga: Controls routing
- apigw1 -> eks1
- eks1 -> auth1
- auth1 -> dynamo1: Idempotency check
- auth1 -> aurora1: Write (ledger)
- eks1 -> saga1
- saga1 -> sqs1: DLQ on failure
- aurora1 -> rep_aurora
- rep_aurora -> aurora2
- dynamo1 -> rep_dynamo
- rep_dynamo -> dynamo2
- apigw2 -> eks2
- eks2 -> auth2
- auth2 -> dynamo2
- auth2 -> aurora2
- cloudwatch -> arc: Trigger failover

## Conflict Resolution: The Hardest Problem

Conflicts in active-active payment systems are not theoretical — they happen every time there is replication latency and concurrent traffic. The question is not whether they will occur, but how the system behaves when they do.

**Typical conflict scenario:** User has a balance of $1,000. Makes two withdrawals of $800 almost simultaneously — one processed in us-east-1, another in sa-east-1, before replication propagates the updated balance. Without protection, both are authorized, resulting in a negative balance of -$600.

**Our approach: Balance Reservation with Distributed Lease**

For transactions involving balance (withdrawals, debit transfers), we implement a **distributed lease** mechanism using DynamoDB with conditional writes. Before processing, the authorization service attempts to acquire an exclusive lease for the account with a 5-second TTL. The conditional write guarantees that only one region acquires the lease at a time.

```
ConditionExpression: attribute_not_exists(lease_owner) OR lease_expires < :now
UpdateExpression: SET lease_owner = :region, lease_expires = :now + 5s, version = version + 1
```

If the lease cannot be acquired (another region is processing), the request returns HTTP 409 with `Retry-After: 1`. The client retries, and on the second attempt the previous lease has already expired or been released.

**Why not use Two-Phase Commit (2PC)?**

2PC guarantees distributed atomicity, but introduces a transaction coordinator that becomes a single point of failure and adds latency proportional to the number of participants. For a payments API with a 200ms P99 SLA, cross-region 2PC is not viable — the round-trip between us-east-1 and sa-east-1 alone is ~120ms.

**Why not use Paxos/Raft?**

Consensus protocols like Raft guarantee strong consistency, but require majority quorum for each write. With three regions, this means each transaction needs confirmation from at least 2 regions — again, unacceptable latency for the happy path.

**The pragmatic solution:** Separate the data domain by access pattern. Data requiring strong consistency (balances, ledger) lives in Aurora with a single writer at a time. Data tolerating eventual consistency (idempotency keys, audit events) lives in DynamoDB. The distributed lease via DynamoDB is the coordination mechanism — it is eventual, but the short TTL (5s) limits the conflict window to an interval acceptable for the business.

## Evaluated Architecture Alternatives

### Active-Passive with Aurora Multi-AZ

**Pros**
- No write conflicts — single writer always
- Simple and well-understood operational model
- Significantly lower cost

**Cons**
- RTO of 2–5 minutes for region failover (unacceptable for PIX)
- Secondary region serves no traffic — wasted capacity
- Manual or semi-automatic failover with human error risk

**Verdict:** Rejected: RTO incompatible with regulatory requirements

### Active-Active with CockroachDB (multi-region)

**Pros**
- Native global serializable consistency
- No need for application-level conflict resolution logic
- Familiar SQL with distributed ACID semantics

**Cons**
- Write latency proportional to quorum: ~120–200ms cross-region
- Not an AWS managed service — increases operational overhead
- Significantly higher licensing and operational cost

**Verdict:** Rejected: Write latency violates 200ms P99 SLA

### Active-Active with Aurora Global + DynamoDB (proposed)

**Pros**
- Aurora replication < 1s meets financial RPO
- DynamoDB Global Tables for operational state with millisecond latency
- Route 53 ARC + Global Accelerator for failover in seconds
- Managed services reduce operational overhead

**Cons**
- Complexity of distributed lease logic in the application
- Aurora still has single writer — promotion takes 60–90s
- Cross-region replication cost (egress + storage)

**Verdict:** Accepted: Best balance between consistency, latency, and operability

### Active-Active with Event Sourcing + Global CQRS

**Pros**
- Complete and immutable audit trail by design
- Conflict resolution via event reordering (causal ordering)
- Excellent read scalability with regional projections

**Cons**
- Very high implementation complexity — long learning curve
- Eventual consistency for balance reads is problematic for UX
- Requires complete rewrite of the domain model

**Verdict:** Considered for future version — excessive complexity for initial phase

## Decision: Data Replication Strategy

**Status:** accepted

**Context**

We need RPO < 1s for financial data, but strong global consistency implies unacceptable write latency. The decision is about how to segment data between persistence mechanisms.

**Decision**

Ledger data (balances, confirmed transactions) → Aurora Global Database with single writer. Operational data (idempotency keys, leases, sessions) → DynamoDB Global Tables with LWW. Audit events → Kinesis Data Streams with cross-region replication via S3 Replication.

**Consequences**
- Aurora writer is a single point of failure for writes — failover takes 60–90s, not 30s
- DynamoDB LWW is acceptable for idempotency keys because the first write is what matters and TTL is long
- Estimated additional cost of 35–45% vs. single-region architecture (estimate based on AWS public pricing)

## Phased Rollout Plan

1. **Phase 0 — Foundation (Weeks 1–4)** — Provision multi-region infrastructure via IaC (Terraform). Configure Aurora Global Database with replication to sa-east-1. Configure DynamoDB Global Tables. Implement Route 53 ARC with readiness checks. No production traffic in new regions. Validate RPO via fault injection tests (AWS FIS) in staging environment.

2. **Phase 1 — Multi-Region Reads (Weeks 5–8)** — Route 100% of reads (balance queries, history) to the region closest to the client using regional Aurora readers. Writes still go 100% to us-east-1. Monitor replication lag and read consistency. KPI: P99 lag < 500ms. Rollback: remove regional read routing, return to us-east-1.

3. **Phase 2 — Multi-Region Write Canary (Weeks 9–14)** — Enable writes in sa-east-1 for 5% of Brazilian client traffic (selection by feature flag in API Gateway). Activate distributed lease mechanism. Monitor conflicts, lease acquisition latency, and 409 rate. Gradually increase to 25%, 50%, 100% with quality gates at each step. Granular rollback by feature flag without customer impact.

4. **Phase 3 — Automatic Failover (Weeks 15–18)** — Activate automatic failover via Route 53 ARC based on CloudWatch alarms (error rate > 1% for 60s, P99 latency > 500ms for 120s). Conduct game day: simulate total failure of us-east-1 during low-traffic hours. Measure actual RTO. Document failover and recovery runbook. Train on-call team.

5. **Phase 4 — Third Region and Steady State (Weeks 19–24)** — Add us-west-2 as third recovery cell. Implement regional circuit breaker in API Gateway. Conduct monthly chaos engineering with AWS FIS. Review and adjust failover thresholds based on real data. Publish external SLA of 99.995% after 30 days of stable operation.

> **Critical Risks and Mitigations:** **Risk 1: Clock Skew Between Regions**
DynamoDB LWW uses server timestamps. If there is clock skew between regions (possible even with NTP), the 'last write' may not be the chronologically most recent. Mitigation: use logical versioning (monotonic version number) instead of timestamps for critical leases. Monitor clock drift via CloudWatch.

**Risk 2: Retry Cascade During Failover**
When a region fails, all clients retry simultaneously to the surviving region, potentially causing a thundering herd. Mitigation: implement exponential backoff with jitter on the client, circuit breaker in API Gateway with per-account rate limiting, and pre-warmed auto-scaling in the secondary region.

**Risk 3: Orphan Lease from Instance Crash**
If the instance that acquired the lease crashes before releasing it, the lease remains held until TTL expires (5s). During these 5s, the account is blocked for new transactions. Mitigation: 5s TTL is acceptable for the business; implement lease heartbeat for long transactions; monitor lease expiration rate.

**Risk 4: Schema Divergence During Deploy**
Rolling deploys across multiple regions may have different database schema versions running simultaneously. Mitigation: mandatory backward-compatible migrations (never remove columns, only add); feature flags for new fields; cross-version compatibility tests in the CI pipeline.

**Risk 5: Unexpected Egress Cost**
Cross-region replication generates egress costs that can be significant at high volume. Estimate: at 10,000 TPS with average payload of 2KB, Aurora replication egress is ~1.7TB/day.

## AWS Well-Architected Assessment

- **security**: Mandatory mTLS between services; KMS with CMK per region for data at rest; IAM roles with least privilege per microservice; VPC with private subnets and PrivateLink for AWS services; WAF with rate limiting rules per payment account
- **reliability**: Route 53 ARC for automatic failover; Aurora Global with RPO < 1s; DynamoDB Global Tables multi-master; regional circuit breakers; monthly chaos engineering with AWS FIS
- **performance**: Global Accelerator for low-latency routing; regional reads via Aurora readers; DynamoDB for operational state with millisecond latency; auto-scaling based on business metrics (TPS) not just CPU
- **sustainability**: Workload consolidation on Graviton3 (ARM) instances for better energy efficiency; aggressive auto-scaling to reduce idle capacity; regions with renewable energy mix prioritized when possible

## Success Metrics and Targets

- **Availability:** ≥ 99.995% measured monthly
- **RTO (region failover):** < 90 seconds (routing: < 30s; Aurora promotion: < 90s)
- **RPO:** < 1 second for ledger data
- **P99 Latency (authorization):** < 200ms on happy path
- **Write conflict rate:** < 0.01% of transactions (409 responses)
- **Aurora replication lag P99:** < 1 second in normal operation
- **Automated failover success rate:** 100% in game days (without manual intervention)
- **Additional cost vs. single-region (estimate):** 35–45% (based on AWS public pricing, without negotiation)

> **My Perspective: What I Would Do Differently:** After 16 years building financial systems, the most expensive lesson I've learned is that **active-active architectures are sold as availability solutions, but the real problem is consistency**. Most teams underestimate the second and overestimate the first.

The design proposed here is deliberately conservative on one point: keeping Aurora with a single writer. I know this seems to contradict 'active-active', but for financial data, I prefer to call it 'active-active on routing, active-passive on the ledger'. This is not a design weakness — it is honesty about where consistency is non-negotiable.

If I could change one thing in this design, it would be to **invest earlier in replication observability**. Not just lag metrics, but causal tracing of cross-region transactions: given a payment_id, being able to reconstruct in which region each step was processed, with which version of data, and what the replication path was. That is what saves you at 3am during an incident.

On the rollout: Phase 2 (write canary) is where most projects fail. Teams tend to accelerate the rollout when the first 5% appears stable. My recommendation is to hold each gate for at least one week of peak traffic, not just days. Payments have seasonal patterns (end of month, holidays) that only appear with sufficient observation time.

Finally, the point that rarely appears in architecture documents: **train your on-call team for failover before you need it**. The best-written runbook fails when the person executing it is under pressure for the first time.

## Verdict

This design is technically viable and operationally responsible for a payments API with financial-grade requirements. The combination of Aurora Global Database, DynamoDB Global Tables, and Route 53 ARC represents the state of the art in multi-region resilience on AWS, with explicit and manageable trade-offs.

The critical point any engineer must understand before implementing: **you are not eliminating failures, you are choosing how to fail**. This design chooses to fail detectably (409 on lease conflict, latency degradation during Aurora failover) rather than silently (overdraft, transaction duplication). For financial systems, this choice is the correct one.

The additional cost of 35–45% vs. single-region is real and must be justified with business data: what is the cost of 4 minutes of downtime in terms of lost transactions, regulatory fines, and reputational damage? For most payment processors at scale, the answer makes the investment obvious. For smaller systems, a well-executed active-passive architecture with a 2-minute RTO may be the economically correct choice.

## References

- [AWS Architecture Center](https://aws.amazon.com/architecture/)
- [Amazon Aurora Global Database — Documentation](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database.html)
- [DynamoDB Global Tables — Documentation](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GlobalTables.html)
- [Route 53 Application Recovery Controller](https://docs.aws.amazon.com/r53recovery/latest/dg/what-is-route-53-recovery.html)
- [AWS Global Accelerator — Documentation](https://docs.aws.amazon.com/global-accelerator/latest/dg/what-is-global-accelerator.html)
- [Building Multi-Region Active-Active Architecture — AWS Blog](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/)
- [AWS Fault Injection Simulator — Documentation](https://docs.aws.amazon.com/fis/latest/userguide/what-is.html)
- [Saga Pattern for Distributed Transactions — AWS Prescriptive Guidance](https://docs.aws.amazon.com/prescriptive-guidance/latest/modernization-data-persistence/saga-pattern.html)

## Case sources

- [AWS Architecture Center](https://aws.amazon.com/architecture/)
