Figma: Horizontal Postgres Sharding Without Stopping Growth
Listen to study
generated on playGenerated only on first play
In 2022, Figma hit the physical limits of its monolithic Postgres database and executed a horizontal sharding migration using key-based partitioning, dynamic routing, and incremental data movement — all without downtime and without halting product growth. This teardown reconstructs the architecture, analyzes the technical decisions, and points out what I would do differently.
Database sharding is one of the riskiest operations an engineering team can execute in production. Figma did it with Postgres, at global SaaS product scale, without downtime and without halting growth — and then wrote about it. This teardown dismantles the architecture, evaluates the decisions, and says what I would change.
Fact Sheet
- Company
- Figma
- Domain
- Collaborative design SaaS platform
- Migration period
- 2021–2022 (published 2022)
- Original database
- Monolithic PostgreSQL (single write instance)
- Target database
- PostgreSQL with horizontal sharding by tenant/object key
- Main stack
- PostgreSQL, PgBouncer, Ruby on Rails, AWS infrastructure
- Routing strategy
- Custom application-layer routing (shard map)
- Downtime
- Zero downtime for end users
- Problem impact
- CPU, WAL lag, and connection capacity near the physical limits of the primary instance
The Problem: A Monolithic Postgres Swallowing a Growing Company
Figma grew too fast for its own data infrastructure. For years, the database architecture followed the classic successful-startup pattern: a primary Postgres with read replicas, scaling vertically whenever warning signs appeared. That approach worked — until the moment it stopped working.
The problem was not a single catastrophic event. It was a convergence of pressures: the primary instance accumulating chronically high CPU, WAL (Write-Ahead Log) showing increasing lag on replicas, and the connection pool operating near PgBouncer's physical limits. In high-frequency transactional systems like Figma — where multiple users edit the same document in real time — any degradation in the primary database propagates directly to the user experience.
Vertical scaling had reached its practical ceiling. Moving to a larger instance would solve the problem for months, not years. The team needed a structural solution, and the only way out was horizontal sharding — distributing data across multiple independent Postgres nodes, each responsible for a subset of the system's entities.
What makes this problem especially hard is that Figma did not have the luxury of pausing the product to perform the migration. Business growth continued, new users arrived, new files were created. The migration had to happen underneath a fully operational system, without users noticing.
The Reconstructed Architecture: How the System Works
Figma's solution can be decomposed into three interdependent layers: logical partitioning, dynamic routing, and incremental migration with dual-write.
Logical partitioning by entity key
The chosen sharding scheme was key-based — specifically, on the root entity identifier (typically file_id or tenant equivalent). This means all data related to a given file or user is colocated on the same physical shard. Colocation is critical: cross-shard queries are extremely costly and, if poorly planned, eliminate any scaling gains. Figma opted for aggressive colocation to avoid this antipattern.
The number of logical shards was defined as an intentional multiple of the physical shards. This is a sophisticated engineering decision: by separating logical shards from physical shards, the system can rebalance data by moving logical shards between physical nodes without needing to rehash all keys. It is the same principle as consistent hashing, but implemented more explicitly with an application-controlled shard map.
Dynamic routing at the application layer
Instead of adopting an external database proxy (such as Vitess or Citus), Figma implemented routing inside the Rails application itself. A central component — the shard map — maintains the mapping of logical shard_id → physical Postgres instance. Before any query, the application consults this map to determine which database connection to route the operation to.
This choice has deep implications. On one hand, it eliminates a network hop and an additional infrastructure layer to operate. On the other hand, it couples routing logic to application code, meaning every change to the shard map needs to be coordinated with application deploys — or managed via feature flags and dynamic configuration.
Incremental migration with dual-write and verification
The most delicate part of the entire operation was migrating existing data without downtime. The process followed a classic online migration pattern: first, enable dual-write for the entities being migrated, where each write goes to both the source and destination shard; second, run the historical data backfill in the background with rate control to avoid overloading the primary; third, verify consistency between source and destination before cutting read traffic; fourth, promote the destination shard as authoritative and remove the duplicate write.
Each phase of this migration was executed incrementally, table by table, entity by entity. It was not a big-bang migration — it was a series of surgical migrations, each with a defined rollback path.
Figma Horizontal Sharding Architecture
Read/write flow with shard map routing, dual-write during migration, and per-physical-shard replicas.
- Figma Client · (Browser / Desktop)
- Rails App · (API Servers)
- Shard Map · (logical→physical)
- PgBouncer · (connection pool)
- Postgres Primary · Shard 0
- Postgres Replica · Shard 0
- Postgres Primary · Shard 1
- Postgres Replica · Shard 1
- Postgres Primary · Shard N
- Postgres Replica · Shard N
- Backfill Worker · (rate-limited)
- Consistency Verifier · (diff checker)
The Hidden Complexity: What Doesn't Show Up in the Diagram
Any sharding diagram looks reasonable on paper. The brutality is in the operational details the diagram doesn't capture.
The distributed transactions problem
Postgres has no native support for distributed transactions across independent instances. This means any operation that needs atomicity across two different shards must be redesigned. Figma solved this in two ways: first, through aggressive colocation of related entities on the same shard (avoiding the problem at the source); second, by accepting eventual consistency in cases where cross-shard atomicity was infeasible and redesigning the data model to eliminate cross-shard dependencies wherever possible.
This is a design decision with direct product impact. Some query patterns that were trivial in the monolith — a JOIN between tables of different entities — become prohibited or very costly in the sharded world. The engineering team had to audit all existing queries and classify them by access pattern before defining the partitioning strategy.
The human cost of dual-write
The dual-write phase is conceptually simple but operationally expensive. During the migration period, each relevant write needs to go to two destinations. This increases write latency, doubles pressure on the connection pool, and introduces a new failure vector: what happens if the write to the destination shard fails? Figma had to define explicit failure-handling policies for each phase of the migration — and those policies had to be conservative enough not to corrupt data, but aggressive enough not to stall the product.
Managing the shard map as critical infrastructure
The shard map is not a static configuration file. It is an infrastructure component that needs to be read at high frequency (every query passes through it), have very low latency, and be updated atomically during rebalancing. Figma had to treat the shard map as a first-class service — with application-level caching, controlled invalidation, and dedicated monitoring. A shard map failure is a global failure: if the application cannot resolve which shard to route a query to, it cannot serve any request.
Schema migrations in a sharded world
In a monolith, a schema migration is a single, controlled event. With N physical shards, each migration needs to be executed N times, in a coordinated fashion, without causing inconsistency between shards during the transition period. Figma adopted the practice of backward-compatible schema changes — adding columns as nullable before making them required, keeping old columns during transition periods — to ensure that different shards could be in slightly different schema states during rollout.
Decision Matrix: Sharding Options Considered
Application-layer routing (Figma's choice)
- No additional network hop
- Full control over routing logic and failover
- No external product dependency on the critical path
- Native integration with feature flags and incremental rollout
- Sharding logic coupled to application code
- Each language/service needs to implement routing
- Shard map becomes internally managed critical infrastructure
Correct for the context: homogeneous Rails stack, team with full code ownership
Vitess (MySQL-compatible proxy)
- Transparent sharding for the application
- Centralized connection management
- Widely tested in production (YouTube, PlanetScale)
- Incompatible with Postgres — would also require database migration
- Adds complex infrastructure layer to operate
- Additional proxy latency on the critical path
Discarded: incompatible with Postgres and introduces double risk
Citus (Postgres extension for sharding)
- Native sharding within the Postgres ecosystem
- Distributed SQL with cross-shard join support
- Compatible with existing ecosystem tooling
- Adds operational complexity of a critical extension
- Schema and data migration to Citus model would be big-bang
- Vendor lock-in to a specific extension
Viable, but migration cost and lock-in were unfavorable in context
Continued vertical scaling
- Zero application code changes
- Operationally simple
- Physical hardware ceiling already near
- Exponential cost for linear capacity gains
- Does not solve the connection problem (PgBouncer has limits)
Palliative solution — buys time but does not solve the structural problem
AWS Well-Architected Framework Read
Security
Adequate for the context, but not highlighted. The post does not detail the security posture, but the architecture implies multiple database endpoints, each requiring independent access control. In regulated environments (Figma is not financial, but serves enterprise companies), the proliferation of database endpoints increases the attack surface and audit complexity. Per-shard rotated credentials, per-instance network policies, and per-shard access auditing are requirements that scale linearly with the number of shards.
Reliability
Strong. The incremental migration with dual-write and consistency verification is a high-reliability pattern. Each phase has a defined rollback. The separation between logical and physical shards enables rebalancing without full rehashing. The main risk is the shard map as a logical single point of failure — if not treated with proper redundancy and caching, a failure there is a global failure. Figma demonstrates awareness of this by treating the shard map as critical infrastructure.
Performance efficiency
Strong. The central objective of the project was performance, and the architecture addresses it directly: distributing write load across multiple primaries, reducing lock contention, and per-shard connection pools instead of a single global pool. Colocation of related entities on the same shard minimizes query costs. The performance degradation risk lies in unanticipated cross-shard queries — which, without rigorous auditing, can surface in production.
Sustainability
Neutral. More instances means more energy consumption. There is no public information about specific energy efficiency strategies for this architecture. Consolidating logical shards into fewer physical shards when growth stabilizes would be the most direct sustainability lever.
Figma's execution is technically solid and the post is one of the most honest pieces on sharding I've read. But there are three points where I would make different decisions — or at least question them earlier. 1. Shard map as an explicit service, not an embedded library. Implementing routing inside the Rails application is pragmatic for a homogeneous stack, but creates a long-term problem: when Figma adds services in other languages (Go, Python, whatever), each will need to reimplement the shard map client. I would have extracted the shard map as a lightweight service (an HTTP/gRPC endpoint with aggressive client-side caching), ensuring that routing logic was centralized and versioned independently from application code. The latency cost of a local cache with a short TTL is negligible; the cost of maintaining N implementations of the same routing logic in N languages is high. 2. Cross-shard query auditing as a mandatory gate before sharding. The biggest risk in a sharding migration is discovering, in production, that a critical query joins entities that ended up on different shards. I would have invested more time — before any migration code — in instrumenting the monolithic database to identify all join patterns, classify entities by access affinity, and only then define the partitioning strategy. This is tedious, slow work, but it is what separates a successful sharding migration from a production incident six months later. 3. Explicit eventual consistency in the API contract. During the dual-write phase, there is a window where reads may return data from
Verdict
Figma executed one of the most complex database migrations in the modern SaaS ecosystem and did so incrementally, reversibly, and without downtime for end users. The choice of application-layer routing over an external proxy is defensible and probably correct for the context — homogeneous stack, team with full code ownership, and need for native integration with the deploy cycle. What impresses me most about this case is not the technology — key-based sharding is a known pattern. What impresses is the execution discipline: the separation between logical and physical shards to facilitate future rebalancing, the consistency verification process before each cutover, and the incremental approach that allowed rollback at any point. Figma's original post is mandatory reading for any engineer working with data at scale. It does not sell a magic solution — it documents the hard work, the real trade-offs, and the problems that appeared along the way. That is rare and valuable. For engineers evaluating similar architectures: horizontal Postgres sharding is viable without external tooling, but the operational cost is permanent and scales with the number of shards. Evaluate Citus or managed