ADR: Aurora Sharding — App-Level vs Aurora Limitless vs Citus
Listen to study
generated on playGenerated only on first play
A high-growth OLTP workload exhausted the capacity of a single Aurora PostgreSQL writer. This ADR evaluates three sharding strategies — application-layer sharding, Aurora Limitless Database, and managed Citus/PostgreSQL — weighing operational complexity, cost, cross-shard query support, and migration risk.
When a single Aurora PostgreSQL writer is no longer sufficient, the decision of how to distribute load is not trivial. Each sharding approach carries a different contract with the engineering team: full control at high cognitive cost, a managed service with model constraints, or a database extension with portability trade-offs. This ADR documents the reasoning that would lead me to choose Aurora Limitless as the primary path — and when I would change my mind.
Scenario Context
- System
- High-growth OLTP platform (composite scenario, not a specific company)
- Current engine
- Aurora PostgreSQL 15, single writer, multiple read replicas
- Problem scale
- Writer saturated: CPU > 85% sustained, growing WAL lag, P99 write latency > 400ms
- Estimated volume
- ~50k transactions/second peak, main dataset ~4TB, 15% monthly growth
- Access pattern
- 80% of writes concentrated on main entity with natural shard key (tenant_id / account_id)
- Constraints
- 6-engineer data team; 99.99% SLA; maintenance window < 30 min/month
- Relevant stack
- AWS (us-east-1), Aurora PostgreSQL, Lambda, ECS, Terraform, Datadog
- ADR status
- Accepted — Aurora Limitless as primary decision
The Real Problem: Why Sharding Is the Last Option, Not the First
Before any conversation about sharding, it's mandatory to exhaust vertical and read-scaling alternatives. In this scenario, we've already done that: the writer is on db.r7g.16xlarge, read replicas absorb 100% of read queries via PgBouncer with session pooling, and indexes have been reviewed by a senior DBA. The problem is structural — write volume doesn't fit in a single node, and Aurora doesn't offer native horizontal write scaling in the classic model.
The diagnosis matters here. The bottleneck is not storage I/O — Aurora decouples storage from compute and the distributed storage layer scales automatically. The bottleneck is the WAL writer process and the lock manager on the compute node. Every committed transaction must pass through the writer, serialize the WAL, and wait for the storage node quorum ACK. With 50k TPS peak and average transactions of 5-8 statements, we're talking about hundreds of thousands of lock operations per second. This is a compute limit, not a storage limit.
The second dimension of the problem is growth. With 15% monthly growth, even if a vertical upgrade bought 6 months, we'd be back at the same point in under a year — with an even larger, more expensive instance and fewer escape options. Sharding isn't just about the present; it's about creating an architecture that scales with the business without requiring a new platform decision every growth cycle.
The third dimension is the team. Six data engineers with a 99.99% SLA means every unplanned oncall hour is expensive. The sharding choice cannot create an operational surface the team can't sustain. This immediately eliminates any solution that requires the team to implement their own rebalancing, shard metadata management, or cross-shard query routing from scratch.
The Forces at Play
Any sharding decision involves tensions that can't be resolved — only managed. I'll be explicit about each one:
Shard key and data distribution. We have tenant_id as a natural key. This is a huge advantage: most queries are scoped by tenant, meaning cross-shard queries are the exception, not the rule. But tenants aren't uniform — some have 100x more data than others. Any sharding strategy needs to handle hot shards from large tenants, either via sub-sharding or special routing.
Cross-shard queries. Even with 80% of queries scoped by tenant, the remaining 20% exist: aggregate reports, admin queries, cross-tenant joins in specific features. The question isn't whether cross-shard queries will happen, but what the cost of executing them is and who pays that cost — the application, the database, or the engineer who needs to rewrite the query.
Rebalancing. Shards aren't static. Tenants grow, new tenants arrive, tenants leave. The ability to move data between shards without downtime is critical. Solutions that require the team to implement this manually are a serious operational risk for a 6-person team.
PostgreSQL compatibility. The system uses PostgreSQL-specific features: JSONB, pg_trgm for partial full-text search, and some legacy stored procedures. Any solution needs to preserve this compatibility or the application migration cost explodes.
Cost. Aurora Limitless has a different pricing model than classic Aurora — you pay for Distributed Processing Units (DPUs) in addition to storage. For workloads that could be served by 2-3 independent writers with app-level sharding, Limitless cost may be 30-50% higher (estimate based on public AWS benchmarks, not direct measurement in this scenario). That delta needs to be justified by reduced operational complexity.
Vendor lock-in. Aurora Limitless is AWS-specific. Citus is open-source but has a managed version on Azure (Cosmos DB for PostgreSQL). App-level sharding with standard PostgreSQL is the most portable. For a system with 99.99% SLA and critical business dependency, portability has real value — but it's not free.
Evaluated Options
Option A: Application-Layer Sharding
- Full control over routing logic and rebalancing
- Maximum portability — any PostgreSQL works as a shard
- Potentially lower infra cost (independent Aurora writers)
- No dependency on experimental features or specific managed services
- High implementation complexity: shard registry, connection routing, distributed transactions
- Cross-shard queries must be implemented and maintained in the application (manual scatter-gather)
- Shard rebalancing is the team's responsibility — serious operational risk
- Schema migrations must be coordinated across all shards simultaneously
- Reference: Figma and Notion took 12-18 months to stabilize their implementations
Viable for large teams with distributed systems expertise. Inadequate for a 6-engineer team with critical SLA.
Option B: Aurora Limitless Database
- Transparent sharding managed by AWS — no shard registry in the application
- Automatic shard rebalancing without downtime
- Standard PostgreSQL interface — compatibility with existing drivers via Transaction Router
- Horizontal write scaling without application architecture changes
- Native integration with AWS ecosystem (IAM, CloudWatch, Parameter Groups)
- Vendor lock-in: AWS-exclusive feature, no direct migration path to another provider
- Additional DPU cost — may be 30-50% more expensive than manual sharding at moderate scale (estimate)
- Documented limitations: some DDLs and PostgreSQL features not supported in distributed model
- Relatively new feature (GA in 2024) — less battle-tested production history
- Shard key must be defined at table creation — changing it later is complex
Best trade-off for this scenario: drastically reduces operational burden while maintaining PostgreSQL compatibility and write scalability.
Option C: Citus / Managed PostgreSQL (Azure or self-managed)
- Open-source with mature, well-documented extension
- Native support for distributed queries including cross-shard aggregations and joins
- Greater portability than Limitless — can run on any cloud or on-prem
- Shard rebalancing natively supported by the extension
- Self-managed Citus: high operational burden — upgrades, HA, backups, rebalancing are team's responsibility
- Managed Citus (Azure Cosmos DB for PostgreSQL): cloud provider change — high infra and operational migration cost
- Coordinator/worker model introduces single point of failure in coordinator (mitigable with HA, but adds complexity)
- Compatibility with some advanced PostgreSQL features may vary by version
Excellent technical option if the system is not heavily coupled to AWS or if portability is a strategic priority. Not the case here.
Architecture Decision
OLTP workload with saturated Aurora PostgreSQL writer, 6-engineer data team, 99.99% SLA, natural shard key available (tenant_id), and stack heavily integrated with AWS. Vertical scaling alternatives have been exhausted. The decision needs to balance write scalability, operational burden, and cost.
Adopt **Aurora Limitless Database** as the primary sharding strategy, with `tenant_id` as the shard key for high-write tables. Reference tables (lookup tables, configurations) will be defined as reference tables in Limitless to avoid cross-shard joins. Migration will be done via dual-write with consistency validation before cutover. Maintain classic Aurora as fallback for 90 days post-migration.
- ✅ Horizontal write scaling without application logic changes — Limitless Transaction Router absorbs routing
- ✅ Automatic shard rebalancing eliminates the main source of operational risk in manual sharding
- ✅ Compatibility with existing PostgreSQL drivers — no DAL (Data Access Layer) refactoring
- ⚠️ Vendor lock-in increases: exiting Limitless requires migration to classic Aurora with manual sharding or Citus
- ⚠️ DPU cost needs monitoring — risk of non-linear cost growth with TPS increase
- ⚠️ Shard key (tenant_id) must be present in all write queries on distributed tables — queries without shard key will broadcast and be expensive
Why Not Application-Level Sharding: The Lesson from Figma and Notion
Figma and Notion are well-documented cases of application-layer sharding. Both worked — but both have data engineering teams of dozens of people, years of runway to stabilize the implementation, and the capacity to absorb the costs of a complex migration. What these cases teach is not that application-level sharding is a good idea for everyone — it's that it's a good idea for those who have the resources to do it correctly.
For a 6-engineer team with a 99.99% SLA, the real cost of application-level sharding is not the initial routing code. It's what comes after: the first hot shard that appears at 2am and needs to be manually rebalanced; the schema migration that needs to run across 8 shards in a coordinated fashion with tested rollback; the reporting query someone wrote without a shard key that's now doing a full scan across all shards; the bug in the shard registry that sends 0.1% of writes to the wrong shard and is only discovered 3 days later.
These are not hypothetical problems. They are the problems that invariably appear in manual sharding implementations, and the question is whether the team has the capacity to solve them without compromising the SLA. The honest answer, for a 6-person team without prior experience in distributed sharding, is no.
This doesn't mean application-level sharding is wrong. It means it has an organizational maturity cost that needs to be acknowledged. If the team grows to 15-20 engineers with specific distributed systems experience, the equation changes — and the portability of manual sharding becomes a real asset. Today, it's not.
Target Architecture: Aurora Limitless with Tenant-Based Routing
Write/read flow in production after migration to Aurora Limitless. The Transaction Router is the single entry point — the application doesn't know about shard groups. Reference tables are replicated across all shard groups to avoid cross-shard joins.
- Application · (ECS Tasks)
- PgBouncer · Connection Pool
- Transaction Router · (Limitless Endpoint)
- Shard Group 1 · tenant_id: 0-33%
- Shard Group 2 · tenant_id: 33-66%
- Shard Group 3 · tenant_id: 66-100%
- Reference Tables · (replicated all shards)
- CloudWatch · DPU / Latency / Errors
- Datadog · APM + DB Metrics
- IAM Auth · DB Credentials via Secrets Manager
Detailed Technical Comparison
| Criterion | App-Level Sharding | Aurora Limitless | Managed Citus | |
|---|---|---|---|---|
| Operational burden | High | Low | Medium | — |
| Cross-shard queries | Manual (scatter-gather in app) | Supported via Transaction Router | Natively supported (Citus planner) | — |
| Rebalancing | Manual — team's responsibility | Automatic by AWS | Supported by Citus extension | — |
| PostgreSQL compatibility | Full (standard PostgreSQL) | High, with documented DDL limitations | High, with limitations on some complex queries | — |
| Portability | Maximum | Minimal (AWS-only) | High (open-source) | — |
| Relative cost (estimate) | Lower (independent Aurora writers) | Higher (additional DPUs) | Medium (depends on managed vs self-hosted) | — |
| Feature maturity | High (industry standard) | Medium (GA 2024) | High (Citus exists since 2012) | — |
| Time to production (estimate) | 12-18 months to stabilize | 3-6 months (migration + validation) | 6-9 months (if cloud change) | — |
Migration Plan to Aurora Limitless
- 1
Phase 1 — Audit (Weeks 1-3)
Map all queries that don't include tenant_id. Identify stored procedures with DDL incompatible with Limitless. Catalog candidate reference tables. Measure data distribution by tenant to identify hot tenants.
- 2
Phase 2 — Limitless Environment (Weeks 4-6)
Provision Aurora Limitless cluster in staging environment. Create distributed tables with tenant_id as shard key. Create reference tables for lookup data. Run load tests with shadow traffic.
- 3
Phase 3 — Dual-Write (Weeks 7-10)
Implement dual-write in DAL: all writes go to both classic Aurora AND Aurora Limitless. Reads continue on classic Aurora. Run continuous consistency validation between both systems. Monitor divergences.
- 4
Phase 4 — Gradual Cutover (Weeks 11-14)
Migrate reads to Aurora Limitless by tenant (starting with smaller tenants). Monitor P99 latency and error rate. Gradually increase percentage. Full cutover when 100% of tenants validated.
- 5
Phase 5 — Stabilization (Weeks 15-24)
Maintain classic Aurora as fallback for 90 days. Monitor DPU cost vs projection. Disable dual-write. After 90 days without incidents, deprecate classic Aurora and release resources.
When I analyze this type of decision, the first question I ask is not technical — it's organizational: who will wake up at 3am when this breaks, and what will they need to know to fix it? Application-level sharding is technically elegant when done well. But 'done well' requires a level of operational maturity that most teams underestimate. The routing code is the easy part. The hard part is rebalancing hot shards in production with an active SLA, coordinating schema migrations across multiple shards, and debugging inconsistencies that only appear in cross-shard queries during peak hours. Aurora Limitless solves exactly these problems — at the cost of vendor lock-in and additional DPUs. For a 6-engineer team with a 99.99% SLA, that's a trade-off I would accept without hesitation. DPU cost is predictable and monitorable; the cost of a poorly implemented sharding incident at 3am is not. What I would do differently from what's documented here: I would add an explicit circuit breaker in the Transaction Router — if Limitless returns errors above a threshold, the application automatically falls back to classic Aurora (which remains in dual-write for 90 days). This isn't paranoia; it's acknowledging that Limitless has been GA since 2024 and that any new database feature will have edge cases that only appear in production. On Citus: it's a genuinely good option that I would choose if the system were not heavily coupled to AWS or if there were a strategic portability requirement. The fact that the managed version is on Azure is a real problem for an AWS-first system — not impossible to solve, but it adds a layer of operational complexity that isn't justified here. The general lesson: sharding is a team decision, not just an architecture one. The best technical solution that the team can't operate is worse than a good-enough solution that the team masters.
AWS Well-Architected: Decision Impact
Security
IAM authentication for the Limitless endpoint via Secrets Manager. Encryption at rest and in transit inherited from Aurora. The Transaction Router is the single entry point — reduces attack surface vs multiple independent writers in the app-level sharding model.
Reliability
Aurora Limitless inherits Aurora's distributed storage durability (6 copies across 3 AZs). Automatic shard rebalancing reduces the risk of hot shards causing service degradation. The dual-write plan with fallback to classic Aurora ensures RTO < 5 minutes in case of regression.
Performance efficiency
The Transaction Router distributes writes across shard groups based on tenant_id hash, eliminating the single writer bottleneck. Reference tables avoid cross-shard joins for lookup data. Queries without shard key do broadcasts — they need to be monitored and eliminated.
Sustainability
Consolidation of compute in Limitless vs multiple independent writers may reduce total resource consumption if Limitless is more efficient in DPU allocation per TPS. Requires validation with real benchmarks.
Verdict
Aurora Limitless is the correct decision for this scenario — not because it's the most technically elegant solution, but because it best aligns team capacity, SLA constraints, and delivery speed. Application-level sharding is a legitimate solution for large teams with distributed systems expertise and time to stabilize the implementation. For a 6-engineer team with a 99.99% SLA, the operational cost of manual sharding is prohibitive — not in money, but in risk and cognitive capacity. Citus is technically excellent, but the managed version is on Azure, and moving an AWS-first system to another cloud provider to solve a database problem is a trade-off that doesn't make sense here. The real risks of the decision are two: DPU cost may be non-linear with TPS growth (monitor from day one), and Limitless has been GA since 2024 (maintain the fallback to classic Aurora for at least 90 days post-cutover). Neither of these risks invalidates the decision — they need to be managed. The sharding decision is, ultimately, a decision about where you want complexity to live: in application code, in the managed service, or in the database extension. For this scenario, the managed service is the right answer.