Studies
Design Doc / RFCPayments API (cenário)Resiliência

Design Doc: Multi-Region Active-Active Payments API

Feb 1, 2026 10 min AI-assisted
Share:

Listen to study

generated on play

Generated only on first play

On demand
0:000:00
Speed
The MP3 is saved to S3 after the first play.

This document proposes a multi-region active-active architecture for a critical payments API, targeting near-zero RTO/RPO, deterministic conflict resolution in data replication, and a phased rollout that minimizes operational risk. The design is grounded in real financial engineering principles and AWS patterns, with explicit trade-offs between consistency, latency, and cost.

A payments API that goes down for 4 minutes can cost millions and destroy regulatory trust. This RFC defines how to build a system that survives the total loss of an AWS region without perceptible interruption — and why most naive approaches fail precisely when it matters most.

The Problem: Payments Resilience Is Not Just High Availability

Most systems that self-declare as 'highly available' are, in practice, active-passive with manual or semi-automatic failover. For a payments API, this is insufficient for reasons that go beyond technical SLAs.

First, the regulatory context. Central banks and card schemes (Visa, Mastercard) require RTO measurable in seconds, not minutes. In Brazil, the Central Bank monitors PIX availability with minute-level granularity and fines participants that exceed downtime thresholds. This transforms resilience from an engineering decision into a compliance obligation.

Second, the nature of the data. Payments are not simple idempotent operations. A transfer involves debiting one account, crediting another, recording in a ledger, sending notifications, and frequently integrating with external systems (clearing houses, correspondent banks). If a region fails mid-saga, state can become inconsistent in ways that are hard to detect and dangerous to correct automatically.

Third, the split-brain problem. In a genuine active-active system, two regions can receive concurrent requests for the same resource — for example, two simultaneous withdrawals from the same account processed in different regions before replication propagates the updated balance. Without a robust conflict resolution mechanism, the system may authorize transactions that should be denied, resulting in fraud or overdraft.

This document does not propose 'high availability'. It proposes a system that maintains transactional correctness even under region failure, with minimal added latency on the happy path and predictable degraded behavior on the failure path.

System Fact Sheet

System
Payments API (composite scenario)
Estimated volume
5,000–15,000 TPS at peak
Target AWS regions
us-east-1 (primary), sa-east-1 (secondary), us-west-2 (tertiary)
Target RTO
< 30 seconds for automatic region failover
Target RPO
< 1 second (synchronous replication for financial transactions)
Primary stack
Amazon Aurora Global Database, DynamoDB Global Tables, Amazon Route 53 ARC, API Gateway, EKS, SQS, EventBridge
Regulatory domain
Instant payments (PIX), cards, international transfers
Target availability
99.995% (~26 minutes of downtime/year)

Goals and Non-Goals

✅ GOAL: Survive total loss of any AWS region without manual intervention and without loss of confirmed transactions
✅ GOAL: Guarantee each financial transaction is processed exactly once (exactly-once semantics) even under inter-region network failures
✅ GOAL: P99 latency < 200ms for payment authorization on the happy path (no active failover)
✅ GOAL: Deterministic and auditable conflict resolution for concurrent writes across regions
✅ GOAL: Phased rollout with rollback capability at each phase without customer impact
❌ NON-GOAL: Replication to regions outside AWS (on-premises or other clouds)

Proposed Design: Architecture and Fundamental Decisions

The design starts from a premise I consider non-negotiable for financial systems: correctness precedes availability. This means I prefer to reject a transaction with a clear error rather than process it incorrectly. With that established, the design is organized into three layers.

Layer 1: Routing and Failover (Route 53 ARC + Global Accelerator)

Amazon Route 53 Application Recovery Controller (ARC) is the central control plane for failover. It maintains continuous readiness checks for each recovery cell (one per region) and allows programmatic failover with a single API call. Global Accelerator routes traffic to the nearest region based on network latency, with endpoint-level health checks every 10 seconds.

The critical decision here is not to use pure DNS-based failover. DNS TTLs, even with low values (60s), combined with intermediate resolver caches, make failover timing unpredictable. Global Accelerator operates at the network layer (Anycast), which eliminates this problem and enables failover in seconds.

Layer 2: Transaction Processing (EKS + Saga Pattern)

Each region runs an EKS cluster with the payment microservices. The Choreographed Saga pattern is used for transactions involving multiple services, with events published to Amazon EventBridge. Each saga step is idempotent and records its state in DynamoDB Global Tables with conditional writes.

Idempotency is implemented via an idempotency key in each request header (UUID v4 generated by the client). The authorization service checks for the key's existence in DynamoDB before processing — if it already exists, it returns the previous result. This guarantees exactly-once even when the client retries after a timeout.

Layer 3: Persistence and Replication (Aurora Global + DynamoDB Global Tables)

The financial ledger (balances, confirmed transactions) lives in Amazon Aurora Global Database with a writer instance in us-east-1 and readers in sa-east-1 and us-west-2. Aurora Global's typical replication lag is < 1 second, meeting the defined RPO.

Short-term operational state (idempotency keys, distributed locks, sessions) lives in DynamoDB Global Tables, which uses asynchronous multi-master replication. Here is the central trade-off: DynamoDB Global Tables uses last-writer-wins (LWW) based on timestamp as the default conflict resolution policy. For idempotency keys this is acceptable — the first write wins in practice because the key TTL is long enough. For balances, it is not acceptable — which is why balances live in Aurora, not DynamoDB.

For the Aurora failover case (promoting reader to writer), Route 53 ARC coordinates the sequence: (1) stop writes in the primary region, (2) wait for zero-lag replication confirmation, (3) promote reader, (4) update endpoints. This process typically takes 60–90 seconds, which only meets the 30-second RTO if step 2 is eliminated — which requires accepting potential sub-second data loss. My recommendation is to keep the real RTO at 90 seconds for database failover and be transparent about this with stakeholders.

Multi-Region Active-Active Architecture

Flow of a payment transaction in normal operation and behavior during region failover. Each region is an independent recovery cell capable of processing complete transactions.

🌐 Global Layer
  • Client · Mobile/Web
  • Global Accelerator · Anycast routing
  • Route 53 ARC · Readiness + Failover
  • AWS WAF · DDoS + Rate limit
🇺🇸 us-east-1 (Primary Writer)
  • API Gateway · REST + mTLS
  • EKS Cluster · Payment Services
  • Authorization · Service
  • Saga Orchestrator · EventBridge Pipes
  • Aurora Global · Writer (Primary)
  • DynamoDB · Global Tables
  • SQS FIFO · Dead Letter Queue
🇧🇷 sa-east-1 (Secondary)
  • API Gateway · REST + mTLS
  • EKS Cluster · Payment Services
  • Authorization · Service
  • Aurora Global · Reader → Writer*
  • DynamoDB · Global Tables
🔁 Replication & Control
  • Aurora Replication · < 1s lag
  • DynamoDB Replication · Async multi-master
  • CloudWatch · Alarms + Dashboards

Conflict Resolution: The Hardest Problem

Conflicts in active-active payment systems are not theoretical — they happen every time there is replication latency and concurrent traffic. The question is not whether they will occur, but how the system behaves when they do.

Typical conflict scenario: User has a balance of $1,000. Makes two withdrawals of $800 almost simultaneously — one processed in us-east-1, another in sa-east-1, before replication propagates the updated balance. Without protection, both are authorized, resulting in a negative balance of -$600.

Our approach: Balance Reservation with Distributed Lease

For transactions involving balance (withdrawals, debit transfers), we implement a distributed lease mechanism using DynamoDB with conditional writes. Before processing, the authorization service attempts to acquire an exclusive lease for the account with a 5-second TTL. The conditional write guarantees that only one region acquires the lease at a time.

ConditionExpression: attribute_not_exists(lease_owner) OR lease_expires < :now
UpdateExpression: SET lease_owner = :region, lease_expires = :now + 5s, version = version + 1

If the lease cannot be acquired (another region is processing), the request returns HTTP 409 with Retry-After: 1. The client retries, and on the second attempt the previous lease has already expired or been released.

Why not use Two-Phase Commit (2PC)?

2PC guarantees distributed atomicity, but introduces a transaction coordinator that becomes a single point of failure and adds latency proportional to the number of participants. For a payments API with a 200ms P99 SLA, cross-region 2PC is not viable — the round-trip between us-east-1 and sa-east-1 alone is ~120ms.

Why not use Paxos/Raft?

Consensus protocols like Raft guarantee strong consistency, but require majority quorum for each write. With three regions, this means each transaction needs confirmation from at least 2 regions — again, unacceptable latency for the happy path.

The pragmatic solution: Separate the data domain by access pattern. Data requiring strong consistency (balances, ledger) lives in Aurora with a single writer at a time. Data tolerating eventual consistency (idempotency keys, audit events) lives in DynamoDB. The distributed lease via DynamoDB is the coordination mechanism — it is eventual, but the short TTL (5s) limits the conflict window to an interval acceptable for the business.

Evaluated Architecture Alternatives

Active-Passive with Aurora Multi-AZ

Pros
  • No write conflicts — single writer always
  • Simple and well-understood operational model
  • Significantly lower cost
Cons
  • RTO of 2–5 minutes for region failover (unacceptable for PIX)
  • Secondary region serves no traffic — wasted capacity
  • Manual or semi-automatic failover with human error risk

Rejected: RTO incompatible with regulatory requirements

Active-Active with CockroachDB (multi-region)

Pros
  • Native global serializable consistency
  • No need for application-level conflict resolution logic
  • Familiar SQL with distributed ACID semantics
Cons
  • Write latency proportional to quorum: ~120–200ms cross-region
  • Not an AWS managed service — increases operational overhead
  • Significantly higher licensing and operational cost

Rejected: Write latency violates 200ms P99 SLA

Active-Active with Aurora Global + DynamoDB (proposed)

Pros
  • Aurora replication < 1s meets financial RPO
  • DynamoDB Global Tables for operational state with millisecond latency
  • Route 53 ARC + Global Accelerator for failover in seconds
  • Managed services reduce operational overhead
Cons
  • Complexity of distributed lease logic in the application
  • Aurora still has single writer — promotion takes 60–90s
  • Cross-region replication cost (egress + storage)

Accepted: Best balance between consistency, latency, and operability

Active-Active with Event Sourcing + Global CQRS

Pros
  • Complete and immutable audit trail by design
  • Conflict resolution via event reordering (causal ordering)
  • Excellent read scalability with regional projections
Cons
  • Very high implementation complexity — long learning curve
  • Eventual consistency for balance reads is problematic for UX
  • Requires complete rewrite of the domain model

Considered for future version — excessive complexity for initial phase

Decision: Data Replication Strategy

Accepted
Context

We need RPO < 1s for financial data, but strong global consistency implies unacceptable write latency. The decision is about how to segment data between persistence mechanisms.

Decision

Ledger data (balances, confirmed transactions) → Aurora Global Database with single writer. Operational data (idempotency keys, leases, sessions) → DynamoDB Global Tables with LWW. Audit events → Kinesis Data Streams with cross-region replication via S3 Replication.

Consequences
  • Aurora writer is a single point of failure for writes — failover takes 60–90s, not 30s
  • DynamoDB LWW is acceptable for idempotency keys because the first write is what matters and TTL is long
  • Estimated additional cost of 35–45% vs. single-region architecture (estimate based on AWS public pricing)

Phased Rollout Plan

  1. 1

    Phase 0 — Foundation (Weeks 1–4)

    Provision multi-region infrastructure via IaC (Terraform). Configure Aurora Global Database with replication to sa-east-1. Configure DynamoDB Global Tables. Implement Route 53 ARC with readiness checks. No production traffic in new regions. Validate RPO via fault injection tests (AWS FIS) in staging environment.

  2. 2

    Phase 1 — Multi-Region Reads (Weeks 5–8)

    Route 100% of reads (balance queries, history) to the region closest to the client using regional Aurora readers. Writes still go 100% to us-east-1. Monitor replication lag and read consistency. KPI: P99 lag < 500ms. Rollback: remove regional read routing, return to us-east-1.

  3. 3

    Phase 2 — Multi-Region Write Canary (Weeks 9–14)

    Enable writes in sa-east-1 for 5% of Brazilian client traffic (selection by feature flag in API Gateway). Activate distributed lease mechanism. Monitor conflicts, lease acquisition latency, and 409 rate. Gradually increase to 25%, 50%, 100% with quality gates at each step. Granular rollback by feature flag without customer impact.

  4. 4

    Phase 3 — Automatic Failover (Weeks 15–18)

    Activate automatic failover via Route 53 ARC based on CloudWatch alarms (error rate > 1% for 60s, P99 latency > 500ms for 120s). Conduct game day: simulate total failure of us-east-1 during low-traffic hours. Measure actual RTO. Document failover and recovery runbook. Train on-call team.

  5. 5

    Phase 4 — Third Region and Steady State (Weeks 19–24)

    Add us-west-2 as third recovery cell. Implement regional circuit breaker in API Gateway. Conduct monthly chaos engineering with AWS FIS. Review and adjust failover thresholds based on real data. Publish external SLA of 99.995% after 30 days of stable operation.

Critical Risks and Mitigations

Risk 1: Clock Skew Between Regions DynamoDB LWW uses server timestamps. If there is clock skew between regions (possible even with NTP), the 'last write' may not be the chronologically most recent. Mitigation: use logical versioning (monotonic version number) instead of timestamps for critical leases. Monitor clock drift via CloudWatch. Risk 2: Retry Cascade During Failover When a region fails, all clients retry simultaneously to the surviving region, potentially causing a thundering herd. Mitigation: implement exponential backoff with jitter on the client, circuit breaker in API Gateway with per-account rate limiting, and pre-warmed auto-scaling in the secondary region. Risk 3: Orphan Lease from Instance Crash If the instance that acquired the lease crashes before releasing it, the lease remains held until TTL expires (5s). During these 5s, the account is blocked for new transactions. Mitigation: 5s TTL is acceptable for the business; implement lease heartbeat for long transactions; monitor lease expiration rate. Risk 4: Schema Divergence During Deploy Rolling deploys across multiple regions may have different database schema versions running simultaneously. Mitigation: mandatory backward-compatible migrations (never remove columns, only add); feature flags for new fields; cross-version compatibility tests in the CI pipeline. Risk 5: Unexpected Egress Cost Cross-region replication generates egress costs that can be significant at high volume. Estimate: at 10,000 TPS with average payload of 2KB, Aurora replication egress is ~1.7TB/day. Monitor via AW

AWS Well-Architected Assessment

Security

Mandatory mTLS between services; KMS with CMK per region for data at rest; IAM roles with least privilege per microservice; VPC with private subnets and PrivateLink for AWS services; WAF with rate limiting rules per payment account

Reliability

Route 53 ARC for automatic failover; Aurora Global with RPO < 1s; DynamoDB Global Tables multi-master; regional circuit breakers; monthly chaos engineering with AWS FIS

Performance efficiency

Global Accelerator for low-latency routing; regional reads via Aurora readers; DynamoDB for operational state with millisecond latency; auto-scaling based on business metrics (TPS) not just CPU

Sustainability

Workload consolidation on Graviton3 (ARM) instances for better energy efficiency; aggressive auto-scaling to reduce idle capacity; regions with renewable energy mix prioritized when possible

Success Metrics and Targets

Availability
≥ 99.995% measured monthly
RTO (region failover)
< 90 seconds (routing: < 30s; Aurora promotion: < 90s)
RPO
< 1 second for ledger data
P99 Latency (authorization)
< 200ms on happy path
Write conflict rate
< 0.01% of transactions (409 responses)
Aurora replication lag P99
< 1 second in normal operation
Automated failover success rate
100% in game days (without manual intervention)
Additional cost vs. single-region (estimate)
35–45% (based on AWS public pricing, without negotiation)
FA
My Perspective: What I Would Do Differently
Senior Solutions Architect

After 16 years building financial systems, the most expensive lesson I've learned is that active-active architectures are sold as availability solutions, but the real problem is consistency. Most teams underestimate the second and overestimate the first. The design proposed here is deliberately conservative on one point: keeping Aurora with a single writer. I know this seems to contradict 'active-active', but for financial data, I prefer to call it 'active-active on routing, active-passive on the ledger'. This is not a design weakness — it is honesty about where consistency is non-negotiable. If I could change one thing in this design, it would be to invest earlier in replication observability. Not just lag metrics, but causal tracing of cross-region transactions: given a payment_id, being able to reconstruct in which region each step was processed, with which version of data, and what the replication path was. That is what saves you at 3am during an incident. On the rollout: Phase 2 (write canary) is where most projects fail. Teams tend to accelerate the rollout when the first 5% appears stable. My recommendation is to hold each gate for at least one week of peak traffic, not just days. Payments have seasonal patterns (end of month, holidays) that only appear with sufficient observation time. Finally, the point that rarely appears in architecture documents: train your on-call team for failover before you need it. The best-written runbook fails when the person executing it is under pressure for the first time. Monthly game days are not overhead — they are th

Verdict

This design is technically viable and operationally responsible for a payments API with financial-grade requirements. The combination of Aurora Global Database, DynamoDB Global Tables, and Route 53 ARC represents the state of the art in multi-region resilience on AWS, with explicit and manageable trade-offs. The critical point any engineer must understand before implementing: you are not eliminating failures, you are choosing how to fail. This design chooses to fail detectably (409 on lease conflict, latency degradation during Aurora failover) rather than silently (overdraft, transaction duplication). For financial systems, this choice is the correct one. The additional cost of 35–45% vs. single-region is real and must be justified with business data: what is the cost of 4 minutes of downtime in terms of lost transactions, regulatory fines, and reputational damage? For most payment processors at scale, the answer makes the investment obvious. For smaller systems, a well-executed active-passive architecture with a 2-minute RTO may be the economically correct choice. The 24-week phased rollout may seem conservative, but it is the responsible minimum for a system where errors h

#multi-region#active-active#payments#resilience#rto-rpo#conflict-resolution#aws#data-replication
Share:
Written with AI assistance from the public case and my architect's reading.