ADR: Replacing SMS OTP with Silent Authentication in Cognito
Listen to article
generated on playGenerated only on first play
Powered by Amazon Polly + OmniVoice
SMS OTP is simultaneously the most widely deployed authentication mechanism and one of the weakest: vulnerable to SIM swap, SS7 interception, and social engineering, with only ~80% completion rates. This ADR examines the decision to replace or complement SMS OTP with network-silent authentication via Vonage integrated into Amazon Cognito's CUSTOM_AUTH flow.
In financial-grade systems, authentication is not merely an entry gate — it is a measurable attack surface with direct operational cost. When 1 in 5 legitimate users abandons an SMS OTP verification flow and account takeover attacks have grown 141% since 2021, the decision to keep SMS OTP as the primary mechanism needs to be justified as a conscious architectural choice, not inherited default. This ADR documents the analysis I conducted when evaluating the replacement of the SMS OTP flow with network-silent mobile authentication, integrated into Amazon Cognito via CUSTOM_AUTH — and why, for financial CIAM environments, this is not an obvious decision but one that carries real consequences on each side.
Context and Forces at Play
The system under analysis is a financial CIAM platform with ~4 million monthly active users, operating across multiple Latin American countries. The current authentication flow uses Amazon Cognito with SMS OTP via SNS for second-factor verification. Monthly SMS costs run approximately $18,000–$22,000, of which we estimate 8–12% originates from artificially inflated traffic (AIT) — a conservative number based on SNS log analysis and destination patterns.
The forces driving this review are concrete:
Security pressure: The fraud team identified three successful SIM swap incidents over the past six months, all resulting in complete account takeover. In every case, the SMS OTP was delivered to the attacker because the swap had already occurred before the authentication attempt. Static number lookup databases detected none of the three events.
Conversion pressure: The mobile login flow completion rate sits at 78.3% — below the 80% industry average. Each percentage point of conversion represents approximately $340,000 in annual revenue for this specific product.
Regulatory pressure: LGPD and Banco Central guidelines for Open Finance require authentication mechanisms to be robust against known attacks. SS7 is a known, documented attack. Keeping SMS as the sole second factor is becoming questionable from a compliance standpoint.
Support cost pressure: The helpdesk processes approximately 12,000 monthly tickets related to OTP not received or expired — an operational cost that doesn't appear on the security dashboard but is very real.
The Real Cost of SMS OTP at Scale
What 'Network-Powered' Actually Means — and Why It Matters for Financial Fraud
The most important technical distinction in this decision is not about UX — it is about the layer from which the identity signal is derived. Traditional identity verification tools operate on aggregated, cached, or behavioral data. A phone number lookup service queries a static database that may be days or weeks out of date. Device fingerprinting analyzes browser characteristics that can be spoofed. Behavioral biometrics builds models from historical sessions — useful, but a lagging indicator by definition.
What differentiates the mobile network approach is the origin layer of the signal: real-time data directly from mobile network operators (MNOs). When you query whether a SIM was recently swapped, you are querying the network that performed the swap. When Silent Authentication verifies a user, the possession proof is the cellular data session itself — something that cannot be phished, intercepted, or socially engineered because it does not exist as a transmissible secret.
For financial fraud scenarios where SIM swaps are weaponized for account takeover, 'recently' means minutes or hours, not days. Static databases refreshed weekly are not detecting these events — they are logging them after the fact. Real-time operator queries close that window entirely.
This layer difference has a direct architectural implication: the Identity Insights controls (sim_swap, device_swap, subscriber_match, recycled_number, format/network_type) must execute before any verification channel is initiated. This shifts the risk decision point from post-challenge to pre-challenge — which, in cost terms, means fraudulent attempts are blocked before a single SMS is sent, before verification costs are incurred.
Options Considered
Option A: Keep SMS OTP via Cognito + SNS (status quo)
- Zero migration effort; already in production
- Native Cognito support without custom Lambda triggers
- Universal coverage — does not depend on cellular data connectivity
- Vulnerable to SIM swap, SS7 interception, and social engineering
- ~20% abandonment rate on mobile flows
- No native AIT/SMS pumping detection; fraud cost silently absorbed
- Questionable under Banco Central strong authentication guidelines
Unacceptable as status quo for financial CIAM in 2026
Option B: TOTP/Authenticator App as second factor
- Native Cognito support (software token MFA)
- Resistant to SIM swap and SS7
- Near-zero per-verification cost after setup
- High onboarding friction — requires app installation and linking
- Adoption rate typically 15–30% in B2C user bases
- Does not solve account recovery when device is lost
- No network signal — does not detect SIM swap in real time
Adequate as opt-in option for high-value users, not as universal replacement
Option C: Silent Authentication via Vonage + Cognito CUSTOM_AUTH (proposed)
- Zero friction for end user on mobile devices with cellular data
- Possession proof based on cellular session — immune to phishing and SS7
- Identity Insights pre-checks block fraud before SMS costs are incurred
- Automatic fallback to SMS/RCS/Voice when cellular data unavailable
- Fraud Defender mitigates AIT/SMS pumping in real time
- Third-party dependency (Vonage/Ericsson) in the critical authentication path
- MNO coverage varies by region — requires per-market validation
- Additional complexity in CUSTOM_AUTH Lambda trigger (Define, Create, Verify)
- Higher per-verification cost than raw SMS, but offset by fraud and support reduction
Recommended decision for mobile-first flows in markets with good MNO coverage
Option D: Passkeys/WebAuthn as long-term replacement
- Phishing resistance by design; no transmissible secret
- Growing platform support on iOS and Android
- Near-zero per-authentication cost
- Cognito support still limited — requires custom implementation
- Complex account recovery when device is lost
- User adoption still on learning curve in emerging markets
18–24 month horizon; does not solve the immediate fraud problem
The Decision: CUSTOM_AUTH as an Architectural Extension Surface
The decision is to adopt Option C as the primary flow for mobile authentication, with Option B as an opt-in layer for high-value users and Option D as an 18-month roadmap item. The central technical mechanism is Amazon Cognito's CUSTOM_AUTH flow, which exposes three extension points via Lambda triggers: DefineAuthChallenge, CreateAuthChallenge, and VerifyAuthChallengeResponse.
What makes CUSTOM_AUTH appropriate here is not just flexibility — it is the correct semantics. The flow was designed for external authentication challenges, and Silent Authentication is exactly that: a challenge resolved by a third party (the MNO via Vonage) rather than directly by the user. Cognito manages the session lifecycle, issues JWT tokens with correct claims, and maintains the audit trail — while verification logic lives in the Lambdas.
Critical Lambda configuration: The CreateAuthChallenge Lambda is where the Vonage Identity Insights API call happens. This Lambda must have a timeout configured for at least 10 seconds (the 3s default is insufficient), use a VPC with NAT Gateway for outbound calls if security policy requires it, and store Vonage API credentials in AWS Secrets Manager with automatic rotation enabled — never in environment variables. The VerifyAuthChallengeResponse Lambda receives the Silent Authentication result and must implement idempotency via DynamoDB with a 5-minute TTL to prevent replay attacks on the verification token.
IAM and isolation: Each Lambda trigger must have a separate IAM role with minimum permissions. The CreateAuthChallenge needs secretsmanager:GetSecretValue scoped to the specific Vonage secret ARN. None of the triggers should have write permissions to the User Pool — only cognito-idp:RespondToAuthChallenge via the Cognito SDK called by the client.
Silent Authentication Flow: Cognito CUSTOM_AUTH + Vonage
Complete flow from mobile login attempt to JWT token issuance, showing the three CUSTOM_AUTH Lambda triggers, Vonage API calls, and risk decision points.
- App Mobile · iOS / Android
- Amazon Cognito · CUSTOM_AUTH flow
- DefineAuthChallenge · Lambda
- CreateAuthChallenge · Lambda (timeout: 10s)
- VerifyAuthChallenge · Lambda
- Secrets Manager · Vonage API Key
- DynamoDB · Idempotency TTL 5min
- CloudWatch · SLO / Anomaly
- Identity Insights API · sim_swap / format / recycled
- Silent Auth / Verify · Cellular session proof
- Fraud Defender · AIT / SMS pumping block
- Operadora Móvel · Real-time SIM data
Technical Consequences: What Changes in Operations
Adopting the CUSTOM_AUTH flow with a third-party dependency in the critical authentication path introduces operational consequences that need to be managed explicitly.
Availability and fallback: Vonage's SLA for the Verify API is documented, but any external dependency in an authentication flow needs a circuit breaker. I implemented this in the CreateAuthChallenge Lambda using AWS Lambda Powertools with the @circuit_breaker decorator, configured to open after 5 consecutive failures within a 60-second window. When the circuit breaker is open, the Lambda returns a fallback challenge that instructs Cognito to use standard SMS OTP via SNS — ensuring no legitimate user is blocked by Vonage unavailability.
Observability: The VerifyAuthChallengeResponse Lambda emits custom CloudWatch metrics with the following dimensions: AuthMethod (silent_auth | sms_fallback | blocked_fraud), RiskSignal (sim_swap_detected | clean | recycled_number), and Outcome (success | failure | challenge_expired). This enables creating a silent authentication SLO separate from the general authentication SLO — critical for identifying MNO coverage degradation by region.
Latency: Silent Authentication adds ~800ms–1.2s of latency to the authentication flow under normal conditions (real-time MNO query). For the CreateAuthChallenge Lambda, I configured memory at 512MB — above the minimum, because the HTTP call to the Vonage API benefits from additional CPU for TLS handshake. The Lambda timeout at 10s is conservative; observed p99 was 3.2s.
Cost: The incremental per-authentication cost via Silent Auth is higher than raw SMS, but the ROI model changes when you factor in: (a) elimination of 8–12% AIT, (b) ~60% reduction in OTP support tickets, and (c) recovery of 2–5 percentage points of conversion. For 4M MAU with a 3x/month login frequency, the math favors migration.
Consequences and Risks Requiring Explicit Mitigation
1. MNO coverage is not uniform. Silent Authentication requires the user's device to be on an active cellular data session at authentication time. Wi-Fi, VPN, and some MVNOs may prevent verification. Validate coverage per carrier and per country before rollout — do not assume your largest market's coverage represents the others.
2. CUSTOM_AUTH does not natively support refresh token rotation. If you use refresh token rotation in Cognito, the CUSTOM_AUTH flow does not automatically inherit that behavior. You will need to implement token revocation logic explicitly in the DefineAuthChallenge Lambda, querying a DynamoDB table of revoked tokens.
3. Third-party dependency in the critical path. Vonage/Ericsson is an AWS partner, but it is still an external system. The circuit breaker is mandatory, not optional. Test the SMS fallback in your gameday runbook — do not discover it does not work during a production incident.
4. Mobile network data is sensitive personal data. Under LGPD, sim_swap and subscriber_match queries involve telecommunications data that may require a specific legal basis and RIPD registration. Involve the DPO before go-live.
Governance, Compliance, and the Residual Threat Model
No authentication decision eliminates risk — it redistributes the threat model. With SMS OTP, the primary vector is the attacker who controls the destination number (SIM swap, SS7). With Silent Authentication, the residual vector is the attacker who compromises the physical device — a significantly higher bar, but not zero.
What the residual threat model looks like: An attacker with physical access to an unlocked device can complete a Silent Authentication. This is true for any device-possession-based mechanism, including TOTP and passkeys. The difference is that Silent Authentication has no secret that can be phished remotely — the attack requires physical presence or OS-level device compromise.
WAF and threat intelligence integration: The CreateAuthChallenge Lambda should log the source IP, user-agent, and device characteristics to CloudWatch Logs with structured logging (JSON). An AWS WAF rule group on the API Gateway preceding Cognito can block IPs with fraud history before the request reaches the CUSTOM_AUTH flow — reducing the cost of Vonage API calls for clearly malicious attempts.
Audit and non-repudiation: Each successful authentication should generate a CloudTrail event with the AuthMethod used. For financial compliance purposes (PCI-DSS, SOC 2), recording which mechanism authenticated each session is necessary for fraud investigations. Configure CloudTrail with S3 Object Lock in COMPLIANCE mode to guarantee immutability of authentication logs for at least 7 years.
Tiered risk policy: The Identity Insights pre-check output should feed a risk policy with three outputs: (1) clean → Silent Auth, (2) elevated risk (e.g., sim_swap in last 24h) → step-up with TOTP or biometrics, (3) high risk (e.g., VoIP number + recent sim_swap) → hard block with fraud team notification. This policy lives in the DefineAuthChallenge Lambda and must be versioned as code.
AWS Well-Architected Framework Analysis
Security
The CUSTOM_AUTH flow with pre-verification Identity Insights raises authentication assurance level. Separate IAM roles per Lambda trigger, credentials in Secrets Manager with rotation, and KMS for encryption at rest in the idempotency DynamoDB are non-negotiable requirements. WAF in front of Cognito adds defense in depth.
Reliability
The circuit breaker in the CreateAuthChallenge Lambda with SMS OTP fallback ensures authentication continues functioning even during Vonage unavailability. The authentication SLO must be monitored separately by method (silent vs. fallback) to detect MNO coverage degradation before it affects users at scale.
Anti-Patterns to Avoid in This Implementation
- Storing Vonage API credentials in Lambda environment variables — use Secrets Manager with automatic rotation and IAM condition
secretsmanager:ResourceTag/Service: vonage-auth - Implementing CUSTOM_AUTH without a circuit breaker — any Vonage latency or unavailability becomes an authentication outage for all mobile users
- Using a single Lambda for all three CUSTOM_AUTH triggers — violates single responsibility and makes debugging authentication flows exponentially harder
- Not implementing idempotency in VerifyAuthChallengeResponse — verification tokens can be reused in replay attacks if there is no invalidation TTL
- Global rollout without per-market MNO coverage validation — SMS fallback rate may be much higher than expected in markets with dominant MVNOs
- Ignoring the data privacy dimension — sim_swap and subscriber_match queries are sensitive telecommunications data requiring explicit legal basis under LGPD/GDPR
In financial implementations I've done with CUSTOM_AUTH, the most expensive mistake was not technical — it was not testing the fallback in production before go-live. The circuit breaker works in staging, but Cognito's fallback behavior when the Lambda returns an unexpected error is different from when it returns an explicit fallback challenge. Test both paths. Second point: the risk policy in DefineAuthChallenge needs to be treated as critical business code, not glue code — version it, test it, and code review it with the fraud team, not just engineering. The hardest lesson I learned is that the MNO coverage the vendor presents in the sales deck is best-case coverage; real coverage in your actual user mix can be 15–20 percentage points lower, especially in markets with high MVNO penetration or users on roaming.
Final Decision and Recommendation
For mobile-first financial CIAM platforms with exposure to SIM swap, AIT, and conversion pressure, migrating from SMS OTP to Silent Authentication via Vonage + Cognito CUSTOM_AUTH is the correct decision — but with conditions. Adopt if: (a) your user base is predominantly mobile with cellular data, (b) you have validated MNO coverage in your primary markets, (c) you implement a circuit breaker with SMS fallback, and (d) you treat the risk policy in DefineAuthChallenge as versioned business code. Do not adopt as a complete SMS replacement if your base has significant Wi-Fi-only or MVNO penetration without API coverage. The residual threat model is acceptable for the assurance level required by Open Finance and PCI-DSS level 2. ROI is positive in 6–9 months for bases above 1M MAU. Passkeys/WebAuthn is the long-term destination — but Silent Authentication is the best available solution today for the mobile authentication problem at financial scale.
References
Ask Fernando about this
Get a focused answer about this article from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.
Keep reading
Architecture intelligence, in your inbox
Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.
- Curated AWS · AI · architecture · market signals
- New architecture studies & deep-dives when they ship
- Sharp summaries — depth without the noise
- No spam · double opt-in · unsubscribe anytime