# Design Doc: RDS Proxy for Lambda + RDS Without Melting the Database

Lambda functions under high concurrency open hundreds of direct connections to RDS, exhausting the pool and crashing the database. This document proposes RDS Proxy as a multiplexing layer, details real pinning pitfalls, compares alternatives such as Data API and application-side poolers, and defines when Proxy is not the right answer.

- URL: https://fernando.moretes.com/studies/design-doc-rds-proxy-pooling-serverless

- Markdown: https://fernando.moretes.com/studies/design-doc-rds-proxy-pooling-serverless/study.md?lang=en

- Type: Design Doc / RFC

- Company: Serverless + RDS (cenário)

- Domain: Dados

- Date: 2026-02-15

- Tags: rds-proxy, lambda, serverless, connection-pooling, aws-rds, data-platform, iam-auth, postgresql

- Reading time: 12 min

---

Every Lambda invocation opens its own database connection. In production, that is not a detail — it is a time bomb. This RFC defines the correct architecture for serverless workloads on RDS, with honesty about costs, pitfalls, and the scenarios where RDS Proxy does not solve the problem.

## The Problem: Connection Storms in Serverless Environments

Lambda's execution model is fundamentally incompatible with the traditional connection model of relational databases. A conventional application process — whether a container or a VM — initializes a connection pool once at startup and reuses those connections throughout its entire lifetime. The database sizes its `max_connections` based on the number of application instances, which grows in a predictable and controlled manner.

Lambda breaks that assumption. Each execution environment (the isolated sandbox running a function) holds at most one connection to the database. When concurrency scales from 10 to 500 simultaneous invocations in a matter of seconds — completely normal behavior during traffic spikes or SQS queue processing with a high batch size — the database receives 490 new connections in a burst. This is what the industry calls a *connection storm*.

PostgreSQL, for example, allocates memory per connection (typically 5–10 MB per worker process). A `db.t3.medium` instance has 2 GB of RAM and a default `max_connections` of around 170. With Lambda at high concurrency, that limit is hit in seconds. The database starts rejecting new connections with `FATAL: remaining connection slots are reserved`, legitimate queries fail, and Lambda's retry cycle worsens the problem — each retry attempts to open yet another connection against an already saturated database.

The problem repeats during cold starts: when a new execution environment is initialized, it must complete the TCP handshake, the TLS handshake, and the database authentication protocol before executing the first query. On PostgreSQL with mandatory SSL, that overhead can add 50–150ms of connection setup alone, appended to the latency perceived by the end user. During failover events — when RDS promotes a replica to primary — all existing connections are dropped simultaneously, and Lambda attempts to reconnect en masse, creating a second storm at precisely the moment the database is most vulnerable.

## Proposed Solution: RDS Proxy as a Multiplexing Layer

RDS Proxy is an AWS-managed proxy that sits between clients (Lambda, ECS, EC2) and the RDS instance. It maintains a pool of persistent, authenticated connections to the database and multiplexes client connections over that smaller pool. From the database's perspective, it sees only the Proxy's connections — a fixed, controlled number. From Lambda's perspective, it sees an endpoint that accepts connections as if it were the database directly.

The central mechanism is *connection multiplexing*: when a Lambda invocation finishes its transaction and releases the connection, the Proxy does not close that connection to the database — it returns it to the pool and reuses it for the next invocation that needs it. This drastically reduces the number of active connections on the database. In a scenario with 500 simultaneous Lambdas, the database may see only 20–50 connections from the Proxy, depending on transaction duration and access patterns.

Beyond pooling, the Proxy offers two critical operational benefits: **accelerated failover** and **centralized authentication**. On failover, the Proxy detects the primary instance's unavailability and redirects connections to the new primary in approximately 20–30 seconds — significantly faster than the time it would take for DNS to propagate the change and for all Lambda instances to reconnect individually. The Proxy's pool connections to the database are re-established internally, and clients using the Proxy endpoint experience a much shorter interruption.

For authentication, the Proxy supports IAM Authentication: instead of storing database credentials in code or environment variables, the Lambda function assumes an IAM Role and obtains a temporary token (valid for 15 minutes) via `generate_db_auth_token`. The Proxy validates that token against IAM and authenticates the connection to the database using credentials stored in Secrets Manager — which it rotates automatically without interrupting existing pool connections. This eliminates the entire class of hardcoded credential vulnerabilities and simplifies password rotation in production.

The minimum viable configuration involves: creating the Proxy in the same VPC as the RDS instance, associating a Secrets Manager secret with the database credentials, creating an IAM Policy that allows `rds-db:connect` for the Proxy ARN, and changing the Lambda application's endpoint from `rds-instance-endpoint` to `proxy-endpoint`. The Proxy is transparent to the database driver — application code does not change.

## Architecture: Lambda + RDS Proxy + RDS (connection flow and failover)

Multiplexed connection flow between concurrent Lambda invocations and the RDS database via Proxy, including IAM authentication, Secrets Manager, and Multi-AZ failover behavior.

### ⚡ Lambda Tier

- Lambda Fn env #1 (compute)
- Lambda Fn env #2 (compute)
- Lambda Fn env #N (concorrente) (compute)

### 🔐 Auth / Secrets

- IAM rds-db:connect token (security)
- Secrets Manager DB credentials (auto-rotation) (security)

### 🔀 Proxy Layer

- RDS Proxy Connection Pool (multiplexing) (network)

### 🗄️ RDS Multi-AZ

- RDS Primary (AZ-a) (data)
- RDS Standby (AZ-b) [failover target] (data)

### Flows

- lambda1 -> iam: generate IAM token
- lambda2 -> iam: generate IAM token
- lambdaN -> iam: generate IAM token
- lambda1 -> rdsproxy: TCP/TLS connection
- lambda2 -> rdsproxy: TCP/TLS connection
- lambdaN -> rdsproxy: TCP/TLS connection
- rdsproxy -> secretsmanager: fetch credentials
- rdsproxy -> rds_primary: persistent pool
(N conns << Lambda concurrency)
- rds_primary -> rds_standby: synchronous replication
- rdsproxy -> rds_standby: automatic failover
(~20-30s)

## Real Pitfalls: Pinning and Other Counterintuitive Behaviors

RDS Proxy is not a silver bullet. The multiplexing benefit depends on a critical premise: that the Proxy can reuse connections between different clients. When it cannot — when a connection is stuck to a specific client — this is called **pinning**, and it is the primary source of frustration with the Proxy in production.

The Proxy pins a connection when it detects that the session state has been modified in a way that cannot be shared with another client. The most common cases are:

**1. Open or long transactions:** Any `BEGIN` without a corresponding `COMMIT`/`ROLLBACK` keeps the connection pinned until the transaction ends. This is expected and correct — the Proxy cannot reuse a connection in the middle of a transaction. The problem occurs when the application uses long transactions (batch processing, multi-step operations) or, worse, when the ORM opens a transaction implicitly and does not close it correctly on error.

**2. Session state with `SET` statements:** Commands like `SET search_path`, `SET TIME ZONE`, `SET LOCAL`, MySQL session variables (`SET @variable`), and `pg_catalog` settings modify the connection state. The Proxy detects these commands and pins the connection, because it cannot guarantee that another client receives a connection with the same state. This is especially problematic with ORMs that emit `SET` statements automatically during session initialization.

**3. Prepared statements:** In PostgreSQL, named prepared statements (`PREPARE stmt AS ...`) are session state and cause pinning. Unnamed prepared statements (the default for most modern drivers) are handled differently and generally do not cause pinning.

**4. Functions and procedures that alter session state:** Any stored procedure that executes `SET` internally can cause pinning without the application code being directly responsible.

The practical consequence of pinning is that the number of connections on the database approaches the number of concurrent Lambdas — exactly the problem the Proxy was supposed to solve. For diagnosis, CloudWatch exposes the `DatabaseConnectionsCurrentlySessionPinned` metric. If this metric is consistently high (above 20–30% of connections), the Proxy's benefit is being wasted.

The solution to pinning involves: (1) ensuring transactions are short and always explicitly closed, (2) moving session configuration to the connection string or to the database/role level instead of runtime `SET` statements, (3) auditing the ORM to understand which statements it emits automatically, and (4) using the Proxy's `TRANSACTION` mode (default) instead of `SESSION` only when the access pattern is truly transactional and stateless.

There is also a less obvious limitation: RDS Proxy **does not support all database features**. For PostgreSQL, there is no support for `pg_notify`/`LISTEN`, logical replication, or connections via `pg_hba.conf` with unsupported authentication methods. For MySQL, there are restrictions on multi-statements and some administrative commands. Before adopting the Proxy on an existing system, it is necessary to audit the use of these features.

## Goals and Non-Goals of this RFC

- ✅ GOAL: Eliminate connection storms in Lambda workloads with high concurrency over RDS PostgreSQL or MySQL
- ✅ GOAL: Reduce the number of active database connections to a fixed, controlled value regardless of Lambda concurrency
- ✅ GOAL: Implement authentication without hardcoded credentials via IAM Auth + Secrets Manager with automatic rotation
- ✅ GOAL: Reduce the impact of Multi-AZ failovers on running Lambda functions
- ✅ GOAL: Define when RDS Proxy is NOT the appropriate solution and which alternatives to use
- ❌ NON-GOAL: Replacing RDS with Aurora Serverless v2 or DynamoDB — that decision is out of scope

## Quick Reference: Scenario and Stack

- **Scenario:** Serverless (Lambda) + RDS — common pattern in APIs, event processing, lightweight ETL
- **Supported databases:** RDS MySQL 5.6/5.7/8.0, RDS PostgreSQL 10.x–16.x, Aurora MySQL, Aurora PostgreSQL
- **Typical Lambda concurrency (problem):** 100–1000 simultaneous invocations opening direct connections to RDS
- **Typical max_connections (db.t3.medium PostgreSQL):** ~170 connections (RAM-limited: ~5-10 MB/connection)
- **RDS Proxy cost:** $0.015/vCPU-hour of the associated RDS instance (estimate: ~$11/month for db.t3.medium)
- **Database connection reduction (estimate):** 60–90% fewer active database connections vs. direct Lambda (without pinning)
- **Failover with Proxy vs. without Proxy:** ~20–30s (Proxy) vs. ~60–120s (DNS propagation + reconnect storm)
- **Authentication stack:** IAM Role → generate_db_auth_token (15min token) → RDS Proxy → Secrets Manager → RDS

## Alternatives: RDS Proxy vs. Data API vs. Application-Side Pooler

### RDS Proxy (proposed)

**Pros**
- Transparent multiplexing — zero application code change
- IAM Auth + automatic credential rotation via Secrets Manager
- Accelerated failover (~20–30s) with reconnection managed by the Proxy
- Fully managed — no additional infrastructure operation

**Cons**
- Additional cost (~$0.015/vCPU-hour) — not justified for low-concurrency workloads
- Pinning can negate the benefit if the application uses session state or long transactions
- Additional ~1ms latency per query (proxy hop overhead)
- Does not support all database features (LISTEN/NOTIFY, logical replication)

**Verdict:** Correct choice for Lambda with high concurrency, short transactions, and stateless pattern

### RDS Data API (Aurora Serverless v1/v2)

**Pros**
- HTTP/HTTPS — no persistent TCP connection, completely serverless
- No pool management or pinning — each call is stateless by design
- Ideal for Lambda with extremely high concurrency and simple queries

**Cons**
- Available only for Aurora Serverless — does not work with standard RDS
- Significantly higher latency per query (~5–20ms overhead vs. ~1ms for Proxy)
- No support for long transactions or streaming of large result sets
- Aurora Serverless v2 cost may be higher than RDS + Proxy for stable workloads

**Verdict:** Valid if already using Aurora Serverless and the pattern is simple, stateless queries. Do not migrate RDS to Aurora just for this.

### Application-Side Pooler (PgBouncer / HikariCP via Lambda Layer)

**Pros**
- Full control over pool configuration (timeouts, size, pooling mode)
- No additional infrastructure cost beyond Lambda
- PgBouncer in transaction pooling mode can be more efficient than RDS Proxy in specific scenarios

**Cons**
- Pool per Lambda execution environment — each sandbox has its own pool, no sharing between parallel invocations
- Does not solve the fundamental problem: 500 Lambdas = 500 independent pools
- PgBouncer as sidecar requires additional infrastructure (ECS, EC2) — not serverless
- Operation and maintenance of proxy infrastructure falls on the team

**Verdict:** Does not solve the connection storm in Lambda. Valid only if the pooler is external (not in Lambda) and managed as a dedicated service.

## Decision: Adopt RDS Proxy with IAM Auth for Lambda + RDS workloads

**Status:** proposed

**Context**

Lambda workload with concurrency > 50 simultaneous invocations accessing RDS PostgreSQL/MySQL. Direct connections exhaust max_connections during peaks. Database credentials stored in environment variables without automatic rotation.

**Decision**

Introduce RDS Proxy between Lambda and RDS. Configure IAM Authentication on the Proxy. Migrate credentials to Secrets Manager with 30-day automatic rotation. Monitor DatabaseConnectionsCurrentlySessionPinned in CloudWatch.

**Consequences**
- Additional cost of ~$0.015/vCPU-hour of RDS — justified by eliminating connection storm outages
- Need to audit the ORM to identify SET statements and implicit transactions that cause pinning
- Lambda needs explicit IAM permission (rds-db:connect) and network access to the Proxy in the same VPC
- Application endpoint changes from direct RDS to Proxy endpoint — configuration change, not code change

## Rollout Plan

1. **Phase 0 — Audit (Week 1)** — Map all Lambda functions accessing RDS. Identify ORMs and drivers in use. Audit automatic SET statements, long transaction usage, LISTEN/NOTIFY, and named prepared statements. Measure current max_connections and peak DatabaseConnections in CloudWatch. Define P50/P99 query latency baseline.

2. **Phase 1 — Proxy Infrastructure (Week 2)** — Create secret in Secrets Manager with database credentials. Create RDS Proxy in the same VPC as the RDS instance, associating the secret. Configure Security Groups: Lambda SG → Proxy SG (port 5432/3306), Proxy SG → RDS SG (port 5432/3306). Create IAM Policy with rds-db:connect for the Proxy ARN and attach to Lambda Execution Role. Test Proxy-to-database connectivity via AWS console.

3. **Phase 2 — Staging Validation (Week 3)** — Change DB_HOST environment variable in staging Lambda functions to the Proxy endpoint. Run load tests simulating concurrency peaks (100–500 simultaneous invocations). Monitor DatabaseConnectionsCurrentlySessionPinned, DatabaseConnections, ClientConnections in CloudWatch. Validate that the number of database connections does not exceed the defined target. Measure P99 latency — Proxy overhead should be < 2ms under normal conditions.

4. **Phase 3 — Gradual Production Rollout (Week 4)** — Use Lambda aliases and weighted routing to migrate 10% of traffic to the Proxy endpoint. Monitor metrics for 24h. Scale to 50%, then 100% if metrics are within baseline. Keep the direct RDS endpoint as fallback for 2 weeks before decommissioning. Configure CloudWatch alarm on DatabaseConnectionsCurrentlySessionPinned > 30% as a regression signal.

5. **Phase 4 — Credential Rotation and Hardening (Week 5)** — Enable 30-day automatic rotation in Secrets Manager. Remove environment variables with hardcoded credentials from Lambda functions. Configure RDS Proxy with require_tls=true. Review IAM Policies for least privilege: rds-db:connect only for the specific Proxy ARN, not wildcard. Document pinning troubleshooting runbook for the operations team.

> **Risks and When NOT to Use RDS Proxy:** **Risk 1 — Silent pinning:** The biggest operational risk is undetected pinning. If the ORM emits SET statements during session initialization (SQLAlchemy, Hibernate, Sequelize do this), the Proxy pins each connection and the multiplexing benefit disappears. The DatabaseConnectionsCurrentlySessionPinned metric needs to be on an observability dashboard from day 1, not as an afterthought.

**Risk 2 — Cost in low-concurrency workloads:** For Lambda functions with concurrency < 20–30 simultaneous invocations, the Proxy cost (~$11/month for db.t3.medium) may not be justified. In these cases, increasing max_connections via instance parameter or migrating to a larger instance may be sufficient.

**Risk 3 — Long transactions:** Workloads that process large batches within a single transaction (ETL, migrations, reports) will keep the connection pinned for the entire duration of the transaction. The Proxy does not help in these cases — and may even increase latency due to the additional hop. For these patterns, consider execution outside Lambda (ECS Fargate, Glue) or breaking transactions into smaller units.

**Risk 4 — LISTEN/NOTIFY and logical replication:** If the system uses PostgreSQL LISTEN/NOTIFY for pub/sub or logical replication for CDC (Change Data Capture), the Proxy does not support these features. Connections for these purposes must go directly to the database, and the architecture needs to accommodate two distinct endpoints.

**Risk 5 — Proxy cold start latency:** The Proxy itself has an initialization time when created. In development/staging environments that are frequently created and destroyed, the overhead of provisioning the Proxy can be relevant. In production, the Proxy is persistent and this risk does not apply.

## Assessment: AWS Well-Architected Framework

- **security**: IAM Authentication eliminates hardcoded credentials. Secrets Manager with automatic rotation reduces the exposure window for compromised credentials. Security Groups with least-privilege network access between Lambda, Proxy, and RDS. Mandatory TLS on all connections.
- **reliability**: Accelerated Multi-AZ failover (~20–30s vs. ~60–120s without Proxy). Proxy's persistent pool absorbs reconnection spikes. Elimination of connection storms that caused cascading degradation.

> **My Perspective: The Proxy Is Correct, but It Is Not Magic:** After dealing with financial systems where a connection storm at 9am meant rejected transactions and customers calling the support center, I learned to respect the database connection problem in serverless environments. RDS Proxy is the right solution for the right problem — and this is exactly where most teams go wrong: they adopt the Proxy without understanding the pinning mechanism, and are surprised when the number of database connections does not drop as expected.

My practical recommendation: before creating the Proxy, spend an afternoon auditing what your ORM does when initializing a session. SQLAlchemy with `schema_translate_map`, Hibernate with `SET search_path`, Sequelize with timezone configuration — all emit SET statements that cause pinning. This is not a Proxy bug; it is a documented behavior. The solution is usually to move those configurations to the role/database level in PostgreSQL (`ALTER ROLE app_user SET search_path = myschema`) instead of doing it at application runtime.

On cost: for a db.t3.medium instance, the Proxy costs ~$11/month. If you are experiencing connection storms that cause degradation or outages, that cost is trivial. If you have < 30 simultaneous invocations and have never seen max_connections being hit, you do not need the Proxy — do not add complexity without a real problem to solve.

The IAM Auth + Secrets Manager benefit is, in my opinion, frequently underestimated. In security audits of financial systems, hardcoded database credentials in environment variables are a recurring and critical finding. The Proxy solves this elegantly and without application code changes. Even if connection pooling were not needed, I would consider the Proxy solely for the security benefit in systems handling sensitive data.

Finally: if your access pattern includes long transactions, batch processing, or LISTEN/NOTIFY, Lambda is not the right compute for that workload — and the Proxy will not save a fundamentally misguided architecture. In those cases, ECS Fargate with an external PgBouncer or Aurora Serverless v2 with Data API are more honest choices.

## Success Metrics and Targets

- **DatabaseConnections (RDS) at peak:** ≥ 60% reduction vs. pre-Proxy baseline (without pinning)
- **DatabaseConnectionsCurrentlySessionPinned:** < 20% of active connections — alarm at > 30%
- **Lambda connection errors (FATAL: remaining connection slots):** Zero occurrences in production after full rollout
- **Query P99 latency (Proxy overhead):** < 2ms additional overhead vs. direct connection under normal conditions
- **Recovery time on Multi-AZ failover:** < 35s of interruption perceived by Lambda (target: < 30s)
- **Hardcoded credentials in Lambda environment variables:** Zero — 100% migrated to IAM Auth + Secrets Manager

## Verdict

RDS Proxy is the correct solution for the connection storm problem in high-concurrency Lambda workloads on RDS. The proposal is approved with the following conditions: (1) the pinning audit in Phase 0 is mandatory — not optional — before production rollout; (2) the DatabaseConnectionsCurrentlySessionPinned metric must be on an observability dashboard from day one; (3) the additional cost is justified only for concurrency > 30 simultaneous invocations with a transactional, stateless access pattern.

For workloads with long transactions, batch processing, or LISTEN/NOTIFY usage, the Proxy is not the solution — the problem lies in the compute choice, not the connection layer. In those cases, moving the workload to ECS Fargate or using Aurora Serverless v2 with Data API are the correct alternatives.

The security benefit of IAM Auth + Secrets Manager is independent of the pooling benefit and should be adopted in any system handling sensitive data, regardless of concurrency level. Hardcoded credentials in Lambda environment variables are an unacceptable security risk in production systems.

## References

- [Amazon RDS Proxy — AWS Documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-proxy.html)
- [Using Amazon RDS Proxy with AWS Lambda — AWS Lambda Developer Guide](https://docs.aws.amazon.com/lambda/latest/dg/services-rds-tutorial.html)

## Case sources

- [Amazon RDS Proxy — AWS docs](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-proxy.html)
- [AWS — Using RDS Proxy with Lambda](https://docs.aws.amazon.com/lambda/latest/dg/services-rds-tutorial.html)
