Studies
Decision (ADR)OLTP em alto crescimento (cenário)DadosAccepted

ADR: Aurora Sharding — App-Level vs Aurora Limitless vs Citus

Feb 18, 2026 9 min AI-assisted
Share:

Listen to study

generated on play

Generated only on first play

On demand
0:000:00
Speed
The MP3 is saved to S3 after the first play.

A high-growth OLTP workload exhausted the capacity of a single Aurora PostgreSQL writer. This ADR evaluates three sharding strategies — application-layer sharding, Aurora Limitless Database, and managed Citus/PostgreSQL — weighing operational complexity, cost, cross-shard query support, and migration risk.

When a single Aurora PostgreSQL writer is no longer sufficient, the decision of how to distribute load is not trivial. Each sharding approach carries a different contract with the engineering team: full control at high cognitive cost, a managed service with model constraints, or a database extension with portability trade-offs. This ADR documents the reasoning that would lead me to choose Aurora Limitless as the primary path — and when I would change my mind.

Scenario Context

System
High-growth OLTP platform (composite scenario, not a specific company)
Current engine
Aurora PostgreSQL 15, single writer, multiple read replicas
Problem scale
Writer saturated: CPU > 85% sustained, growing WAL lag, P99 write latency > 400ms
Estimated volume
~50k transactions/second peak, main dataset ~4TB, 15% monthly growth
Access pattern
80% of writes concentrated on main entity with natural shard key (tenant_id / account_id)
Constraints
6-engineer data team; 99.99% SLA; maintenance window < 30 min/month
Relevant stack
AWS (us-east-1), Aurora PostgreSQL, Lambda, ECS, Terraform, Datadog
ADR status
Accepted — Aurora Limitless as primary decision

The Real Problem: Why Sharding Is the Last Option, Not the First

Before any conversation about sharding, it's mandatory to exhaust vertical and read-scaling alternatives. In this scenario, we've already done that: the writer is on db.r7g.16xlarge, read replicas absorb 100% of read queries via PgBouncer with session pooling, and indexes have been reviewed by a senior DBA. The problem is structural — write volume doesn't fit in a single node, and Aurora doesn't offer native horizontal write scaling in the classic model.

The diagnosis matters here. The bottleneck is not storage I/O — Aurora decouples storage from compute and the distributed storage layer scales automatically. The bottleneck is the WAL writer process and the lock manager on the compute node. Every committed transaction must pass through the writer, serialize the WAL, and wait for the storage node quorum ACK. With 50k TPS peak and average transactions of 5-8 statements, we're talking about hundreds of thousands of lock operations per second. This is a compute limit, not a storage limit.

The second dimension of the problem is growth. With 15% monthly growth, even if a vertical upgrade bought 6 months, we'd be back at the same point in under a year — with an even larger, more expensive instance and fewer escape options. Sharding isn't just about the present; it's about creating an architecture that scales with the business without requiring a new platform decision every growth cycle.

The third dimension is the team. Six data engineers with a 99.99% SLA means every unplanned oncall hour is expensive. The sharding choice cannot create an operational surface the team can't sustain. This immediately eliminates any solution that requires the team to implement their own rebalancing, shard metadata management, or cross-shard query routing from scratch.

The Forces at Play

Any sharding decision involves tensions that can't be resolved — only managed. I'll be explicit about each one:

Shard key and data distribution. We have tenant_id as a natural key. This is a huge advantage: most queries are scoped by tenant, meaning cross-shard queries are the exception, not the rule. But tenants aren't uniform — some have 100x more data than others. Any sharding strategy needs to handle hot shards from large tenants, either via sub-sharding or special routing.

Cross-shard queries. Even with 80% of queries scoped by tenant, the remaining 20% exist: aggregate reports, admin queries, cross-tenant joins in specific features. The question isn't whether cross-shard queries will happen, but what the cost of executing them is and who pays that cost — the application, the database, or the engineer who needs to rewrite the query.

Rebalancing. Shards aren't static. Tenants grow, new tenants arrive, tenants leave. The ability to move data between shards without downtime is critical. Solutions that require the team to implement this manually are a serious operational risk for a 6-person team.

PostgreSQL compatibility. The system uses PostgreSQL-specific features: JSONB, pg_trgm for partial full-text search, and some legacy stored procedures. Any solution needs to preserve this compatibility or the application migration cost explodes.

Cost. Aurora Limitless has a different pricing model than classic Aurora — you pay for Distributed Processing Units (DPUs) in addition to storage. For workloads that could be served by 2-3 independent writers with app-level sharding, Limitless cost may be 30-50% higher (estimate based on public AWS benchmarks, not direct measurement in this scenario). That delta needs to be justified by reduced operational complexity.

Vendor lock-in. Aurora Limitless is AWS-specific. Citus is open-source but has a managed version on Azure (Cosmos DB for PostgreSQL). App-level sharding with standard PostgreSQL is the most portable. For a system with 99.99% SLA and critical business dependency, portability has real value — but it's not free.

Evaluated Options

Option A: Application-Layer Sharding

Pros
  • Full control over routing logic and rebalancing
  • Maximum portability — any PostgreSQL works as a shard
  • Potentially lower infra cost (independent Aurora writers)
  • No dependency on experimental features or specific managed services
Cons
  • High implementation complexity: shard registry, connection routing, distributed transactions
  • Cross-shard queries must be implemented and maintained in the application (manual scatter-gather)
  • Shard rebalancing is the team's responsibility — serious operational risk
  • Schema migrations must be coordinated across all shards simultaneously
  • Reference: Figma and Notion took 12-18 months to stabilize their implementations

Viable for large teams with distributed systems expertise. Inadequate for a 6-engineer team with critical SLA.

Option B: Aurora Limitless Database

Pros
  • Transparent sharding managed by AWS — no shard registry in the application
  • Automatic shard rebalancing without downtime
  • Standard PostgreSQL interface — compatibility with existing drivers via Transaction Router
  • Horizontal write scaling without application architecture changes
  • Native integration with AWS ecosystem (IAM, CloudWatch, Parameter Groups)
Cons
  • Vendor lock-in: AWS-exclusive feature, no direct migration path to another provider
  • Additional DPU cost — may be 30-50% more expensive than manual sharding at moderate scale (estimate)
  • Documented limitations: some DDLs and PostgreSQL features not supported in distributed model
  • Relatively new feature (GA in 2024) — less battle-tested production history
  • Shard key must be defined at table creation — changing it later is complex

Best trade-off for this scenario: drastically reduces operational burden while maintaining PostgreSQL compatibility and write scalability.

Option C: Citus / Managed PostgreSQL (Azure or self-managed)

Pros
  • Open-source with mature, well-documented extension
  • Native support for distributed queries including cross-shard aggregations and joins
  • Greater portability than Limitless — can run on any cloud or on-prem
  • Shard rebalancing natively supported by the extension
Cons
  • Self-managed Citus: high operational burden — upgrades, HA, backups, rebalancing are team's responsibility
  • Managed Citus (Azure Cosmos DB for PostgreSQL): cloud provider change — high infra and operational migration cost
  • Coordinator/worker model introduces single point of failure in coordinator (mitigable with HA, but adds complexity)
  • Compatibility with some advanced PostgreSQL features may vary by version

Excellent technical option if the system is not heavily coupled to AWS or if portability is a strategic priority. Not the case here.

Architecture Decision

Accepted
Context

OLTP workload with saturated Aurora PostgreSQL writer, 6-engineer data team, 99.99% SLA, natural shard key available (tenant_id), and stack heavily integrated with AWS. Vertical scaling alternatives have been exhausted. The decision needs to balance write scalability, operational burden, and cost.

Decision

Adopt **Aurora Limitless Database** as the primary sharding strategy, with `tenant_id` as the shard key for high-write tables. Reference tables (lookup tables, configurations) will be defined as reference tables in Limitless to avoid cross-shard joins. Migration will be done via dual-write with consistency validation before cutover. Maintain classic Aurora as fallback for 90 days post-migration.

Consequences
  • ✅ Horizontal write scaling without application logic changes — Limitless Transaction Router absorbs routing
  • ✅ Automatic shard rebalancing eliminates the main source of operational risk in manual sharding
  • ✅ Compatibility with existing PostgreSQL drivers — no DAL (Data Access Layer) refactoring
  • ⚠️ Vendor lock-in increases: exiting Limitless requires migration to classic Aurora with manual sharding or Citus
  • ⚠️ DPU cost needs monitoring — risk of non-linear cost growth with TPS increase
  • ⚠️ Shard key (tenant_id) must be present in all write queries on distributed tables — queries without shard key will broadcast and be expensive

Why Not Application-Level Sharding: The Lesson from Figma and Notion

Figma and Notion are well-documented cases of application-layer sharding. Both worked — but both have data engineering teams of dozens of people, years of runway to stabilize the implementation, and the capacity to absorb the costs of a complex migration. What these cases teach is not that application-level sharding is a good idea for everyone — it's that it's a good idea for those who have the resources to do it correctly.

For a 6-engineer team with a 99.99% SLA, the real cost of application-level sharding is not the initial routing code. It's what comes after: the first hot shard that appears at 2am and needs to be manually rebalanced; the schema migration that needs to run across 8 shards in a coordinated fashion with tested rollback; the reporting query someone wrote without a shard key that's now doing a full scan across all shards; the bug in the shard registry that sends 0.1% of writes to the wrong shard and is only discovered 3 days later.

These are not hypothetical problems. They are the problems that invariably appear in manual sharding implementations, and the question is whether the team has the capacity to solve them without compromising the SLA. The honest answer, for a 6-person team without prior experience in distributed sharding, is no.

This doesn't mean application-level sharding is wrong. It means it has an organizational maturity cost that needs to be acknowledged. If the team grows to 15-20 engineers with specific distributed systems experience, the equation changes — and the portability of manual sharding becomes a real asset. Today, it's not.

Target Architecture: Aurora Limitless with Tenant-Based Routing

Write/read flow in production after migration to Aurora Limitless. The Transaction Router is the single entry point — the application doesn't know about shard groups. Reference tables are replicated across all shard groups to avoid cross-shard joins.

🖥️ Application Layer
  • Application · (ECS Tasks)
  • PgBouncer · Connection Pool
🔀 Aurora Limitless — Router Layer
  • Transaction Router · (Limitless Endpoint)
🗄️ Aurora Limitless — Shard Groups
  • Shard Group 1 · tenant_id: 0-33%
  • Shard Group 2 · tenant_id: 33-66%
  • Shard Group 3 · tenant_id: 66-100%
  • Reference Tables · (replicated all shards)
📊 Observability
  • CloudWatch · DPU / Latency / Errors
  • Datadog · APM + DB Metrics
🔒 Security
  • IAM Auth · DB Credentials via Secrets Manager

Detailed Technical Comparison

CriterionApp-Level ShardingAurora LimitlessManaged Citus
Operational burdenHighLowMedium
Cross-shard queriesManual (scatter-gather in app)Supported via Transaction RouterNatively supported (Citus planner)
RebalancingManual — team's responsibilityAutomatic by AWSSupported by Citus extension
PostgreSQL compatibilityFull (standard PostgreSQL)High, with documented DDL limitationsHigh, with limitations on some complex queries
PortabilityMaximumMinimal (AWS-only)High (open-source)
Relative cost (estimate)Lower (independent Aurora writers)Higher (additional DPUs)Medium (depends on managed vs self-hosted)
Feature maturityHigh (industry standard)Medium (GA 2024)High (Citus exists since 2012)
Time to production (estimate)12-18 months to stabilize3-6 months (migration + validation)6-9 months (if cloud change)

Migration Plan to Aurora Limitless

  1. 1

    Phase 1 — Audit (Weeks 1-3)

    Map all queries that don't include tenant_id. Identify stored procedures with DDL incompatible with Limitless. Catalog candidate reference tables. Measure data distribution by tenant to identify hot tenants.

  2. 2

    Phase 2 — Limitless Environment (Weeks 4-6)

    Provision Aurora Limitless cluster in staging environment. Create distributed tables with tenant_id as shard key. Create reference tables for lookup data. Run load tests with shadow traffic.

  3. 3

    Phase 3 — Dual-Write (Weeks 7-10)

    Implement dual-write in DAL: all writes go to both classic Aurora AND Aurora Limitless. Reads continue on classic Aurora. Run continuous consistency validation between both systems. Monitor divergences.

  4. 4

    Phase 4 — Gradual Cutover (Weeks 11-14)

    Migrate reads to Aurora Limitless by tenant (starting with smaller tenants). Monitor P99 latency and error rate. Gradually increase percentage. Full cutover when 100% of tenants validated.

  5. 5

    Phase 5 — Stabilization (Weeks 15-24)

    Maintain classic Aurora as fallback for 90 days. Monitor DPU cost vs projection. Disable dual-write. After 90 days without incidents, deprecate classic Aurora and release resources.

FA
My Perspective: The Invisible Cost of Complexity
Senior Solutions Architect

When I analyze this type of decision, the first question I ask is not technical — it's organizational: who will wake up at 3am when this breaks, and what will they need to know to fix it? Application-level sharding is technically elegant when done well. But 'done well' requires a level of operational maturity that most teams underestimate. The routing code is the easy part. The hard part is rebalancing hot shards in production with an active SLA, coordinating schema migrations across multiple shards, and debugging inconsistencies that only appear in cross-shard queries during peak hours. Aurora Limitless solves exactly these problems — at the cost of vendor lock-in and additional DPUs. For a 6-engineer team with a 99.99% SLA, that's a trade-off I would accept without hesitation. DPU cost is predictable and monitorable; the cost of a poorly implemented sharding incident at 3am is not. What I would do differently from what's documented here: I would add an explicit circuit breaker in the Transaction Router — if Limitless returns errors above a threshold, the application automatically falls back to classic Aurora (which remains in dual-write for 90 days). This isn't paranoia; it's acknowledging that Limitless has been GA since 2024 and that any new database feature will have edge cases that only appear in production. On Citus: it's a genuinely good option that I would choose if the system were not heavily coupled to AWS or if there were a strategic portability requirement. The fact that the managed version is on Azure is a real problem for an AWS-first system — not impossible to solve, but it adds a layer of operational complexity that isn't justified here. The general lesson: sharding is a team decision, not just an architecture one. The best technical solution that the team can't operate is worse than a good-enough solution that the team masters.

AWS Well-Architected: Decision Impact

Security

IAM authentication for the Limitless endpoint via Secrets Manager. Encryption at rest and in transit inherited from Aurora. The Transaction Router is the single entry point — reduces attack surface vs multiple independent writers in the app-level sharding model.

Reliability

Aurora Limitless inherits Aurora's distributed storage durability (6 copies across 3 AZs). Automatic shard rebalancing reduces the risk of hot shards causing service degradation. The dual-write plan with fallback to classic Aurora ensures RTO < 5 minutes in case of regression.

Performance efficiency

The Transaction Router distributes writes across shard groups based on tenant_id hash, eliminating the single writer bottleneck. Reference tables avoid cross-shard joins for lookup data. Queries without shard key do broadcasts — they need to be monitored and eliminated.

Sustainability

Consolidation of compute in Limitless vs multiple independent writers may reduce total resource consumption if Limitless is more efficient in DPU allocation per TPS. Requires validation with real benchmarks.

Verdict

Aurora Limitless is the correct decision for this scenario — not because it's the most technically elegant solution, but because it best aligns team capacity, SLA constraints, and delivery speed. Application-level sharding is a legitimate solution for large teams with distributed systems expertise and time to stabilize the implementation. For a 6-engineer team with a 99.99% SLA, the operational cost of manual sharding is prohibitive — not in money, but in risk and cognitive capacity. Citus is technically excellent, but the managed version is on Azure, and moving an AWS-first system to another cloud provider to solve a database problem is a trade-off that doesn't make sense here. The real risks of the decision are two: DPU cost may be non-linear with TPS growth (monitor from day one), and Limitless has been GA since 2024 (maintain the fallback to classic Aurora for at least 90 days post-cutover). Neither of these risks invalidates the decision — they need to be managed. The sharding decision is, ultimately, a decision about where you want complexity to live: in application code, in the managed service, or in the database extension. For this scenario, the managed service is the right answer.

#aurora#sharding#postgresql#oltp#aurora-limitless#citus#data-platform#scalability
Share:
Written with AI assistance from the public case and my architect's reading.