# Notion: The Block Model and Postgres Sharding

Notion grew from a monolithic Postgres database to a horizontally sharded architecture, partitioning data by tenant to sustain explosive pandemic-era growth. This teardown reconstructs the data modeling decisions, migration process, and real trade-offs of operating Postgres at scale.

- URL: https://fernando.moretes.com/studies/notion-sharding-postgres

- Markdown: https://fernando.moretes.com/studies/notion-sharding-postgres/study.md?lang=en

- Type: Teardown

- Company: Notion

- Domain: Dados

- Date: 2021-10-05

- Tags: postgres, sharding, data-modeling, scalability, migration, multi-tenant, block-model, databases

- Reading time: 6 min

---

In 2020, Notion became a mass-market product. What had been a single, well-behaved Postgres instance started showing serious stress signals — exhausted connections, unpredictable latencies, and a central 'blocks' table growing at a rate that made any maintenance operation a risk event. The answer was logical tenant-based sharding, executed without downtime on existing infrastructure. This teardown reconstructs the architecture, evaluates the decisions, and states what I would do differently.

## Fact Sheet

- **Company:** Notion Labs, Inc.
- **Domain:** Productivity / Collaboration / SaaS
- **Critical period:** 2020–2021 (pandemic growth)
- **Database stack:** PostgreSQL (RDS / managed instances), PgBouncer
- **Core data model:** Universal block table — everything is a block
- **Sharding strategy:** Logical sharding by workspace_id (tenant), then physical
- **Stated impact:** Elimination of connection bottlenecks, p99 latency reduction, ability to scale horizontally without application redesign
- **Migration:** Zero downtime, dual-write + backfill + gradual cutover

## The Problem: Everything Is a Block, and That Has Consequences

Notion's most fundamental modeling decision is also its riskiest from a database perspective: **everything is a block**. Pages, paragraphs, images, tables, bullet points, embedded databases — all are rows in the same `block` table. Each block has a UUID `id`, a `type`, a JSONB `properties` field with variable content, and pointers to `parent_id` and `space_id` (the workspace). The entire document hierarchy is an adjacency tree within that single table.

This choice has real elegance. The data model is uniform, application code deals with a single entity type, and features like moving a block between pages or types are trivial — you just update `parent_id`. There are no schema migrations for new content types; you simply add a new `type` value and populate `properties` differently.

But the elegance has a direct cost at scale. A single table accumulating **all content from all users** creates an enormous contention surface. Indexes grow unbounded. Postgres's `autovacuum` — the background process that reclaims dead row space and updates statistics — starts struggling against write volume. DDL operations (like adding an index) block or take hours. And most critically: Postgres has no native sharding. All load hits a single primary, and vertical scaling has a clear ceiling — both in cost and replication lag.

## The Architecture Before Sharding: A Monolith Under Pressure

Before the migration, Notion operated with a relatively conventional architecture for a growing SaaS: a Node.js/TypeScript application connected to a Postgres primary via PgBouncer for connection pooling, with read replicas for read-heavy queries. PgBouncer was essential — Postgres has a non-trivial per-connection cost (memory, process context), and modern applications with dozens of server instances easily saturate the database with thousands of simultaneous connections.

The connection problem was the first visible crisis signal. With accelerated growth in 2020, the number of application instances increased, and even with PgBouncer, the primary database started hitting limits. But the deeper problem was **write throughput and latency**. The `block` table received extremely high-frequency writes — every keystroke from every user can generate one or more new rows or updates. With millions of active users, this translates to tens of thousands of writes per second, all contending for locks on the same table, on the same indexes.

Autovacuum deserves special attention here. In Postgres, when you update a row, the old version isn't immediately deleted — it stays as a 'dead tuple' until vacuum runs. On high-write tables, accumulation of dead tuples degrades read performance (indexes bloat, sequential scans slow down) and can, in extreme cases, cause **transaction ID wraparound** — a catastrophic event that forces the database to stop accepting writes until vacuum completes. Notion was clearly approaching conditions where autovacuum couldn't keep pace.

## Reconstructed Architecture: Before and After Sharding

Diagram reconstructed from Notion's official post. Shows the block model data flow, shard routing layer, and tenant-based distribution.

### 👤 Clients

- Notion Client (Web / Desktop / Mobile) (user)

### ⚙️ Application Layer

- API Server (Node.js / TypeScript) (compute)
- Shard Router (workspace_id → shard) (compute)

### 🔌 Connection Layer

- PgBouncer Shard 0 (network)
- PgBouncer Shard 1 (network)
- PgBouncer Shard N (network)

### 🗄️ Physical Shards (Postgres)

- Postgres Primary Shard 0 (workspaces 0..k) (data)
- Postgres Replica Shard 0 (data)
- Postgres Primary Shard 1 (workspaces k+1..m) (data)
- Postgres Replica Shard 1 (data)
- Postgres Primary Shard N (data)
- Postgres Replica Shard N (data)

### 🗺️ Shard Mapping

- Shard Map (workspace_id → shard_id) Postgres / in-memory cache (data)

### Flows

- client -> api: request
- api -> router: workspace_id
- router -> shardmap: lookup
- router -> pgb1: shard 0
- router -> pgb2: shard 1
- router -> pgbn: shard N
- pgb1 -> pg0
- pgb2 -> pg1
- pgbn -> pgn
- pg0 -> pg0r: replication
- pg1 -> pg1r: replication
- pgn -> pgnr: replication

## How It Works: Logical Sharding First, Physical Second

Notion's strategy was smart in its sequencing: **introduce logical sharding first without moving data, then migrate physically**. This is the correct pattern for production database migrations — you separate the modeling problem from the migration problem.

**Phase 1 — Logical Sharding:** The application begins treating each `workspace_id` as belonging to a logical shard. A shard map (a simple table, cached in memory) maps `workspace_id → shard_id`. Every query passes through the shard router, which resolves the correct shard before opening a connection. At this point, all logical shards still point to the same physical database — behavior changes, but data doesn't move. This allows testing routing in production without data loss risk.

**Phase 2 — Physical Migration with Dual-Write:** To move workspaces to new physical shards, Notion used the classic dual-write with backfill pattern. The process: (a) start writing to both databases (source and destination) for new events; (b) backfill historical data to the destination; (c) verify consistency; (d) redirect reads to the destination; (e) stop writing to the source. Each step is reversible. The risk of each cutover is limited to the set of workspaces being migrated, not the entire system.

**The key to choosing `workspace_id` as the shard key:** Every Notion query includes workspace context — a user is always operating within a specific workspace. This means the shard key is naturally present in all operations, and **there are no cross-shard queries on the critical path**. This is fundamental: sharding only works well when you can guarantee that the vast majority of queries touch a single shard. Notion's block model, with its hierarchy contained within a workspace, makes this natural.

What sharding does **not** solve: analytical queries that need to aggregate data across multiple workspaces (e.g., product metrics, internal dashboards) now require fan-out to all shards or a separate data pipeline. This is the accepted trade-off — you optimize for the transactional critical path and accept complexity in analytics.

## Decision Matrix: Alternatives to Tenant-Based Sharding

### Sharding by workspace_id (Notion's choice)

**Pros**
- Shard key always present in queries — zero cross-shard on critical path
- Natural tenant isolation — large workspaces don't impact small ones
- Incremental migration possible (workspace by workspace)
- Compatible with the application's mental model

**Cons**
- Very large workspaces (hot tenants) can still overload a single shard
- Cross-workspace analytical queries require fan-out or separate pipeline
- Shard rebalancing is operationally complex

**Verdict:** Correct decision given the data model and access patterns

### Vertical scaling (instance upgrade)

**Pros**
- Zero code or architecture change
- Operationally simple in the short term

**Cons**
- Physical hardware ceiling — doesn't scale indefinitely
- Cost grows non-linearly with instance size
- Doesn't resolve lock contention or autovacuum lag
- Defers the problem, doesn't solve it

**Verdict:** Palliative — adequate as a temporary measure, not as a strategy

### Migrate to distributed database (Vitess, CockroachDB, Spanner)

**Pros**
- Sharding managed by the platform, not the application
- Automatic data rebalancing
- Native horizontal scaling

**Cons**
- Production database migration is extremely high risk
- SQL semantics may differ — requires extensive validation
- Significant operational learning curve
- Potentially high licensing or operational cost

**Verdict:** Valid for greenfield or with much more time available; excessive risk during a crisis

### Read replicas + aggressive caching

**Pros**
- Reduces read load on primary
- Relatively simple to implement

**Cons**
- Doesn't solve the write problem — the main bottleneck
- Cache invalidation is complex in a real-time collaborative model
- Replication lag can cause inconsistent reads

**Verdict:** Complementary, not a substitute — doesn't address the fundamental problem

## Well-Architected Review

- **security**: **Adequate, with caveats.** Tenant isolation at the shard level is a real security improvement — a data leak in one shard doesn't automatically expose all workspaces. However, the shard map itself is a critical security component: if compromised or corrupted, queries can be routed to the wrong shard, exposing another tenant's data. The shard map requires strict access controls, auditing, and integrity validation. There is no public evidence of how Notion handles this.
- **reliability**: **Strong.** Incremental migration by workspace minimized the blast radius of each operation. Dual-write with consistency verification before cutover is the correct pattern. Each shard has a replica, maintaining reasonable RTO/RPO. Residual risk is the hot tenant — a very large enterprise workspace can still degrade an entire shard. Recommended mitigation: per-workspace circuit breakers and shard imbalance monitoring.
- **performance**: **Strong for the transactional path.** Distributing writes across multiple primaries resolves the central bottleneck. Each shard carries a fraction of the autovacuum load, making maintenance manageable. Per-shard PgBouncer maintains efficient pooling. The weak point is the latency overhead of the shard map lookup — mitigated by in-memory cache, but requiring careful invalidation when workspaces are migrated between shards.
- **cost**: **Conscious trade-off.** Multiple Postgres instances (primary + replica per shard) multiply infrastructure cost. However, smaller, more specialized instances are generally more cost-efficient per unit of performance than a single massive instance. The operational cost of managing N shards is real — more surface area for monitoring, backups, version upgrades. This operational cost is frequently underestimated.
- **sustainability**: **Neutral.** Smaller, more efficient instances can be more sustainable than an oversized single instance. However, multiplication of idle instances (shards with low load) can waste resources. Postgres instance auto-scaling is still limited — you can't easily spin primaries up and down.

> **What I Would Do Differently:** The decision to shard by `workspace_id` is correct — I wouldn't change that. What I would question and reinforce are three specific points:

**1. Shard map as a first-class component.** Notion's post treats the shard map almost as an implementation detail, but it's the most critical component of the architecture. Silent corruption in the shard map can route one tenant's writes to another tenant's database — a security and data consistency incident simultaneously. I would treat the shard map with the same rigor as an authentication service: explicit schema versioning, auditing of all mutations, integrity checksums, and chaos tests that simulate map corruption.

**2. Explicit strategy for hot tenants.** Tenant-based sharding solves the general scale problem, but doesn't solve the giant tenant problem — a Fortune 500 company with 50,000 active users in a single workspace. That workspace will inevitably dominate the shard it's on. The solution is to have an explicit strategy: identify tenants above a write threshold, and for those, implement intra-tenant sharding (for example, by `block_id` range or sub-space). This is complex, but necessary for a product serving enterprise.

**3. Analytical pipeline from the start.** Transactional sharding and data analytics are fundamentally different problems. I would establish from the beginning a CDC (Change Data Capture) pipeline — for example, with Debezium or pglogical — that captures all events from all shards and materializes them in a data warehouse (Snowflake, BigQuery, Redshift).

## Verdict

Notion did the right thing in the right order: identified the correct shard key (workspace_id), introduced logical sharding before moving data, and executed the physical migration incrementally and reversibly. This is mature database engineering — it's not glamorous, but it's what separates systems that survive growth from those that collapse under it.

The block model is a data modeling bet that worked for the product but created structural pressure on the database. The elegance of 'everything is a block' has a price that only appears at scale — and Notion paid that price with operational complexity, not product redesign. That's a reasonable trade.

The central lesson: **the shard key choice is practically irreversible**. Once you have millions of workspaces distributed across shards, changing the partitioning strategy is a full-scale data migration. Notion got this choice right because the product's data model (everything contained within a workspace) and access pattern (users operate within one workspace at a time) aligned perfectly with the shard key. When designing a multi-tenant system, the question 'what is my natural shard key?'

## References

- [Notion — Sharding Postgres at Notion (Official Blog Post)](https://www.notion.so/blog/sharding-postgres-at-notion)
- [PostgreSQL Documentation — Autovacuum](https://www.postgresql.org/docs/current/routine-vacuuming.html)
- [PgBouncer — Connection Pooler for PostgreSQL](https://www.pgbouncer.org/)
- [AWS Blog — Sharding with Amazon RDS for PostgreSQL](https://aws.amazon.com/blogs/database/sharding-with-amazon-relational-database-service/)
- [Designing Data-Intensive Applications — Martin Kleppmann (Partitioning chapter)](https://dataintensive.net/)
- [Debezium — CDC for PostgreSQL](https://debezium.io/documentation/reference/connectors/postgresql.html)

## Case sources

- [Notion — Sharding Postgres at Notion](https://www.notion.so/blog/sharding-postgres-at-notion)
