Studies
TeardownDiscordDados

Discord: How to Store Trillions of Messages — Cassandra → ScyllaDB Migration Teardown

Mar 6, 2023 7 min AI-assisted
Share:

Listen to study

generated on play

Generated only on first play

On demand
0:000:00
Speed
The MP3 is saved to S3 after the first play.

Discord migrated its message storage from Apache Cassandra to ScyllaDB, eliminating unpredictable tail latencies and GC pauses that affected millions of users. This teardown reconstructs the architecture, examines the engineering decisions and trade-offs involved, and presents my critical read of what was done well — and what I would do differently.

When you have trillions of messages and tail latency that breaks user experience, swapping the database is not a weekend decision. Discord did exactly that — and the story of how they moved from Cassandra to ScyllaDB, rewrote data services in Rust, and eliminated GC pauses lasting seconds is one of the most instructive data engineering cases of the last decade.

Case Facts

Company
Discord Inc.
Domain
Real-time communication platform
Data scale
Trillions of messages stored; billions of inserts per day (estimate)
Previous stack
Apache Cassandra (JVM), Python/Go data services
Current stack
ScyllaDB (C++), data services in Rust
Original post publication
2023 (Discord official blog)
Main problem solved
Tail latency (p99/p999) and GC pauses in Cassandra causing user-perceptible degradation
Reported outcome
Dramatic tail latency reduction; smaller clusters; more predictable operation

The Problem: Cassandra Worked, But Not Well Enough

Discord adopted Apache Cassandra early, and the choice made sense: wide-column data model, horizontal scalability without a single point of failure, and a mature ecosystem. For a messaging platform where each Discord server (guild) has channels, and each channel has a message history that needs to be read efficiently by timestamp, Cassandra's partition model is almost perfect in theory. You partition by (channel_id, bucket) and cluster by message_id (Snowflake, monotonically increasing), which gives time-range reads within a single partition. Elegant.

The problem wasn't the data model. It was the JVM.

Cassandra runs on the JVM and, with large heaps — necessary to keep indexes and caches in memory for trillions of messages — the garbage collector becomes the enemy. GC pauses of seconds are not hypothetical at this scale; they are routine. Discord reported completely unpredictable p99 and p999 latencies, with spikes reaching seconds on operations that should take milliseconds. On a real-time communication platform, this is perceptible. Users see message history take a long time to load. The experience breaks.

There is also the compaction problem. Cassandra uses LSM trees (Log-Structured Merge-tree) internally, and the compaction process — necessary to maintain read performance over time — competes with production operations for I/O and CPU. In large clusters under continuous load, compaction windows become operational risk events. The Discord team described hot nodes causing tail pressure on the entire cluster, because Cassandra has no per-operation resource isolation — a slow node contaminates the cluster's latency percentile.

The obvious alternative would be to tune the GC, use ZGC or Shenandoah, reduce the heap, fragment the clusters. The team tried these approaches. The problem is that they buy time, they don't solve the root cause: you are running a mission-critical database on top of a runtime with non-deterministic garbage collection. At some point, tuning engineering becomes accumulated technical debt, and the cost-benefit ratio inverts.

Reconstructed Architecture: Message Read/Write Path

Diagram reconstructed based on Discord's official post. Shows the data flow from client to storage, including the Rust data services layer and ScyllaDB as primary storage.

👤 Clients
  • Discord Client · (Web / Mobile / Desktop)
🌐 Edge & Gateway
  • Gateway Service · (WebSocket)
  • API Service · (REST / Internal RPC)
⚙️ Data Services (Rust)
  • Message Data Service · (Rust)
  • Hot Message Cache · (in-process / Redis)
🗄️ Storage Layer
  • ScyllaDB Cluster · (C++, Shard-per-core)
  • Cassandra Cluster · (Legacy / Migration Source)
🔄 Migration Path
  • Data Migrator · (Dual-write / Backfill)

How the System Works: Data Model, ScyllaDB, and the Rust Layer

The core data model remained essentially the same after the migration — and that is an important point. Discord did not need to redesign the schema to move from Cassandra to ScyllaDB, because ScyllaDB is compatible with the CQL (Cassandra Query Language) protocol. The messages table uses (channel_id, bucket) as the partition key, where bucket is derived from the message timestamp to avoid unbounded partitions (an active channel over years would have a gigantic partition without bucketing). The clustering key is message_id, a Snowflake ID that encodes the timestamp, guaranteeing native chronological ordering and efficient range reads.

What changed was the engine underneath. ScyllaDB is a reimplementation of Cassandra in C++ with the Seastar framework, which uses a shard-per-core model: each CPU core has its own data set and its own I/O queue, with no shared memory between cores. This eliminates the lock contention that is endemic in multi-threaded systems with a shared heap. There is no GC. Memory management is deterministic. The practical result is that tail latencies — p99, p999 — collapse to much smaller values and, more importantly, much more predictable ones.

The rewrite of data services in Rust was complementary and equally deliberate. The previous Python/Go services introduced their own latency variability: Go's GC, while much better than the JVM's, still introduces pauses; Python has the GIL. Rust offers deterministic performance without a GC runtime, with compile-time memory safety. For a service that is on the critical path of every message read and write, the choice makes clear technical sense. The cost is the learning curve and slower development velocity — a trade-off Discord evaluated as acceptable given the load profile.

The migration itself was conducted with dual-write and backfill. During the transition phase, new messages were written to both Cassandra and ScyllaDB. A background migration process read historical data from Cassandra and wrote it to ScyllaDB. When data parity was confirmed, read traffic was gradually shifted to ScyllaDB. This approach is conservative and correct for mission-critical data — you never do a big-bang migration on production message storage. The risk of data loss or inconsistency is too high. The cost of dual-write (temporarily doubling write load) is a reasonable price for operational safety.

Decision Matrix: Storage Options Evaluated

Stay with Cassandra (aggressive tuning)

Pros
  • Zero migration cost
  • Team already knows the system
  • Mature ecosystem, broad tooling
Cons
  • Non-deterministic GC is root cause, not symptom
  • Tuning becomes accumulated technical debt
  • Hot nodes and compaction remain operational risks

Rejected: treats symptom, not cause

ScyllaDB (C++, shard-per-core)

Pros
  • CQL-compatible — no schema change
  • No GC: deterministic tail latency
  • Shard-per-core eliminates inter-thread contention
  • Smaller cluster footprint for same load (estimate)
Cons
  • Migration of trillions of records is a high-risk operation
  • Smaller ecosystem than Cassandra
  • Smaller vendor — long-term support risk

Accepted: solves root cause with protocol compatibility

PostgreSQL / CockroachDB (distributed SQL)

Pros
  • Full ACID, flexible queries
  • Excellent ecosystem and tooling
Cons
  • Relational data model is not natural for message wide-column
  • Horizontal scalability more complex at this scale
  • Complete rewrite of schema and access patterns

Rejected: high migration cost without clear gain for this use case

DynamoDB / Bigtable (cloud-native)

Pros
  • Managed operation, no cluster overhead
  • Virtually unlimited scalability
Cons
  • Significant cloud lock-in
  • Cost at trillions of records scale can be prohibitive
  • Less control over tail performance

Not publicly evaluated — lock-in and cost are real barriers at this scale

The Rust Decision: Determinism as a System Requirement

The rewrite of data services in Rust deserves separate analysis because it reveals an important engineering philosophy: when you have an aggressive tail latency requirement, every layer of the stack needs to be evaluated as a potential source of jitter. Discord didn't just swap the database — they audited the critical path end-to-end and identified that the service layer was also a source of variability.

The technical argument for Rust in this context is solid. Rust's ownership model guarantees no runtime garbage collection — allocations and deallocations are deterministic and predictable. For a service processing millions of read and write operations per second, the absence of GC pauses — even the sub-millisecond ones from Go — translates to lower and more stable latency percentiles. Additionally, Rust has performance close to C/C++ without the memory safety risks that would make C++ impractical for a product team.

The real cost of Rust is development velocity and the size of the engineer pool. Rust has a significant learning curve — the borrow checker is a real obstacle for developers coming from GC languages. For a company of Discord's size, this means the base of engineers who can contribute to these services is smaller, and the onboarding time for new engineers is longer. This is a trade-off that makes sense for low-level infrastructure services with extreme performance requirements, but would be questionable for high-level business services where iteration speed matters more than tail latency.

What Discord did correctly was to limit Rust usage to the critical path — the data services that sit directly in front of the database. They did not rewrite everything in Rust. This containment is mature: you apply the most expensive tool (in terms of development cost) where the return is highest.

AWS Well-Architected Framework Read

Security

The post does not detail security controls, but the architecture implies network isolation between layers (data services not directly exposed), and ScyllaDB's shard-per-core model reduces the attack surface of shared processes. A blind spot: there is no mention of encryption in transit between data services and ScyllaDB, nor of granular access controls at the table level. In financial or regulated systems, this would be a critical gap.

Reliability

High. ScyllaDB's replication model (inherited from Cassandra) offers node failure tolerance without downtime. Migration via dual-write and gradual backfill is the correct pattern for zero downtime on critical data. The residual risk is the dual-write window: if there is data divergence between Cassandra and ScyllaDB during migration, reconciliation is complex. There is no mention of an explicit rollback strategy.

Performance efficiency

This is the central pillar of the case. The combination of ScyllaDB (no GC, shard-per-core) + Rust (no GC, zero-cost abstractions) eliminates the two main sources of latency jitter identified. The wide-column data model with timestamp bucketing is correct for the access pattern (range reads by channel). The use of a hot cache for recent messages reduces pressure on ScyllaDB for the most common access case.

Cost optimization

Discord reports smaller clusters for the same load after migration — ScyllaDB extracts more performance per node than Cassandra, which translates to less hardware. The migration cost (engineering, dual-write, parallel operation of two clusters) was significant, but is a one-time cost. The recurring operational cost should be lower. The Rust rewrite has a high engineering cost, but permanently reduces infrastructure cost.

Sustainability

Smaller clusters for the same load means lower energy consumption. ScyllaDB's per-node efficiency and the absence of GC (which wastes CPU on collection) reduce the computational footprint. It is not the focus of the post, but it is a real benefit.

FA
What I Would Do Differently
Senior Solutions Architect

The decision to migrate to ScyllaDB was correct and well-executed. But there are three areas where I would have made different choices or added layers that the post does not mention. 1. Explicit rollback strategy during migration. The post describes dual-write and backfill, but does not mention a formal rollback plan. In systems with trillions of records, the dual-write window can last weeks or months. During that period, you need an automated reconciliation process that detects divergences between Cassandra and ScyllaDB and a clear decision about which source is authoritative at each phase. I would implement a continuous validation pipeline that samples random partitions and compares checksums, with automatic alerts for divergences above a threshold. Without this, you are flying blind during the migration. 2. Tail observability as a first-class citizen. The original problem was unpredictable tail latency. After migration, you need instrumentation that proves the problem was solved and detects regressions before they affect users. This means latency histograms (not averages) exposed via OpenTelemetry, with alerts on p99 and p999 per operation and per ScyllaDB shard. The Rust services should have integrated distributed tracing from the first production deploy — not as a later addition. 3. Dynamic bucketing based on channel activity. The current schema uses fixed timestamp bucketing to avoid gigantic partitions. This works, but it is a static compromise. Channels with very high activity (large servers with thousands of messages per minute) and nearly inactive c

Verdict

The Discord case is an example of mature data engineering: identifying the correct root cause (GC non-determinism on the JVM), choosing a replacement with protocol compatibility to minimize migration risk (ScyllaDB + CQL), executing the migration conservatively (dual-write + gradual backfill), and extending the determinism principle to the service layer (Rust). Each decision has clear technical logic and trade-offs are explicitly acknowledged. What makes this case valuable as a study is not the specific technology — ScyllaDB may not be the right choice for other contexts — but the reasoning framework: when tail latency is a system requirement, you need to eliminate sources of non-determinism at every layer of the stack, and this often requires trading development convenience for runtime predictability. The combination of ScyllaDB + Rust is a deliberate bet in that direction. The most relevant residual risk that was not publicly addressed is the dependency on a smaller vendor (ScyllaDB) for critical infrastructure. Cassandra has the Apache Foundation and a huge ecosystem; ScyllaDB is a private company. For a platform of Discord's size, this is a long-term risk that deserves a cont

#discord#scylladb#cassandra#rust#data-platform#migration#wide-column#messaging
Share:
Written with AI assistance from the public case and my architect's reading.