# Coinbase (2026): The AWS MSK Control Plane That Froze Trading

On May 7, 2026, a defect in the Amazon MSK control plane prevented automatic partition leader re-election across two of Coinbase's managed Kafka clusters, silently blocking fee, quoting, and trade execution services for hours. The incident exposes the hidden risks of relying on managed services as single coordination points — and the critical need for deep observability into infrastructure you don't operate.

- URL: https://fernando.moretes.com/studies/coinbase-aws-msk-2026

- Markdown: https://fernando.moretes.com/studies/coinbase-aws-msk-2026/study.md?lang=en

- Type: Post-mortem

- Company: Coinbase

- Domain: Dados/Resiliência

- Date: 2026-05-07

- Tags: postmortem, kafka, aws-msk, resilience, coinbase, event-driven, observability, managed-services

- Reading time: 11 min

---

## Incident Facts

- **Company:** Coinbase
- **Date:** May 7, 2026
- **Total impact duration:** Several hours (severe degradation followed by gradual recovery)
- **Affected service:** Amazon MSK (managed Kafka) — two production clusters
- **Business impact:** Near-total Coinbase trading halt: fee, quoting, and order execution services blocked
- **Failure mode:** Silent failure — clusters stuck in 'healing' state, no automatic partition leader re-election
- **Root cause:** Defect in the AWS MSK control plane that blocked Kafka leader re-election
- **Relevant stack:** AWS MSK (managed Kafka), internal fee/quoting/trades services, AWS (undisclosed region)
- **Classification:** Severity 1 incident — critical impact on revenue and customers

Kafka didn't crash. The brokers were up. The cluster health metrics looked reasonable. And yet, nearly all of Coinbase's trading ground to a halt — because the Amazon MSK control plane, the layer Coinbase doesn't operate and can't debug directly, entered a defective state that silently prevented partition leader re-election. This is the kind of failure that doesn't show up on your application dashboards until it's too late: managed infrastructure that stops working without telling you why.

## What Happened

On the morning of May 7, 2026, Coinbase began observing severe degradation across its trading services. It wasn't an abrupt crash — it was a progressive erosion that made initial diagnosis confusing. The **fee calculation**, **quoting**, and **order execution** services began failing or hanging, waiting for responses that never arrived. The cause wasn't obvious: Kafka brokers were responding, network connections were active, and producers could publish messages to some partitions.

The problem was at a level below what Coinbase's engineering teams could directly observe: the **Amazon MSK control plane**. This component — operated exclusively by AWS — is responsible for orchestrating maintenance operations, partition rebalancing, and critically, **partition leader re-election** when a leader broker becomes unavailable or needs to be replaced.

A defect in this control plane placed two production MSK clusters into a state internally referred to as **'healing'** — a maintenance state from which they could not exit. In this state, MSK could not complete leader re-election for the affected partitions. The practical result: **partitions with no active leader**. Producers attempting to write to those partitions received errors or hung. Consumers couldn't advance their offsets. Downstream services — which depend on messages in those topics to calculate fees, generate quotes, and process orders — stopped functioning coherently.

What makes this incident particularly insidious is the **silent failure mode**. There was no explicit crash, no obvious connection error, no 'cluster down' alert. The clusters technically existed and responded to some operations. The failure was in one specific coordination operation — leader re-election — that the managed system was supposed to execute automatically and didn't. This kind of partial failure is the hardest to detect and diagnose, especially when the defective layer is outside your operational control.

## Timeline

1. **T+0 — MSK event begins** — A defect in the Amazon MSK control plane begins affecting two of Coinbase's production Kafka clusters. The clusters enter the 'healing' state. Partition leader re-election is silently blocked.

2. **T+few minutes — First symptoms in services** — Fee calculation, quoting, and trade execution services begin showing elevated latency and timeouts. Kafka producers on leaderless partitions start failing or hanging. Initial alerts are ambiguous — they look like application issues or network latency.

3. **T+X — Initial misdiagnosis** — Teams investigate application services and network infrastructure. Kafka brokers appear healthy in standard metrics. The absence of leaders on specific partitions is not immediately visible in existing dashboards. Diagnosis time is extended by the lack of granular observability into partition leader state.

4. **T+Y — MSK identified as root cause** — The team identifies that affected partitions have no active leaders and that clusters are stuck in the 'healing' state. The issue is escalated to AWS. It is confirmed that the defect is in the MSK control plane, outside Coinbase's control.

5. **T+Z — AWS intervention and recovery begins** — AWS intervenes in the control plane to remediate the defect. Clusters begin exiting the 'healing' state. Partition leader re-election resumes. Coinbase services begin recovering gradually as topics regain active leaders.

6. **T+hours — Full recovery** — All trading services are restored. The team begins post-mortem analysis. The incident is classified as Severity 1 with critical impact on revenue and customer experience.

## Failure Flow: MSK Control Plane and Trading Services

This diagram reconstructs the failure flow of the May 7, 2026 incident. The MSK control plane (operated by AWS, outside Coinbase's control) entered a defective state, blocking partition leader re-election on the two affected clusters. Trading services that depended on those partitions lost the ability to produce or consume messages reliably.

### ☁️ AWS MSK — Control Plane (AWS-operated)

- MSK Control Plane (AWS-managed) (external)
- MSK Cluster A state: HEALING 🔴 (messaging)
- MSK Cluster B state: HEALING 🔴 (messaging)
- Partições sem líder (leader re-election bloqueada) (data)

### 🏦 Coinbase — Serviços de Trading

- Fee Service ❌ bloqueado (compute)
- Quoting Service ❌ bloqueado (compute)
- Trade Execution ❌ bloqueado (compute)
- Kafka Producers timeout / error (compute)
- Kafka Consumers offset stalled (compute)

### 👤 Usuários Finais

- Clientes Coinbase trading indisponível (user)

### Flows

- msk-cp -> msk-cluster-a: orchestrates (defective)
- msk-cp -> msk-cluster-b: orchestrates (defective)
- msk-cluster-a -> partition-leaderless: re-election blocked
- msk-cluster-b -> partition-leaderless: re-election blocked
- kafka-producer -> partition-leaderless: write → error/timeout
- partition-leaderless -> kafka-consumer: consume → stalled
- fee-svc -> kafka-producer: publishes events
- quote-svc -> kafka-producer: publishes events
- trade-svc -> kafka-producer: publishes events
- kafka-consumer -> fee-svc: consumes results
- kafka-consumer -> trade-svc: consumes results
- end-user -> quote-svc: requests quote
- end-user -> trade-svc: submits order

> **Root Cause: Defect in the AWS MSK Control Plane:** A defect in the Amazon MSK control plane — the orchestration layer operated exclusively by AWS — prevented two production Kafka clusters from completing automatic partition leader re-election. The clusters became stuck in the internal 'healing' state, a maintenance state from which they could not exit without external AWS intervention. The result was that multiple partitions had no active leader, silently blocking producers and consumers — no explicit crash, no 'cluster down' alert, no obvious connectivity error. The failure was in a coordination operation that was supposed to be transparent and automatic, but failed in a partial and opaque manner.

## Blast Radius and Why It Was So Wide

The impact of this incident was disproportionately wide for a failure that technically affected 'only' the partition leader re-election mechanism in two Kafka clusters. To understand why, you need to understand the architectural role Kafka plays at Coinbase.

In financial event-driven architectures, Kafka is not just a messaging system — it is the **coordination backbone** between services. The **fee calculation** services need to publish and consume events to calculate fees in real time. The **quoting service** depends on market data streams and internal state to generate accurate quotes. **Order execution** is a pipeline of sequential events where each step depends on the previous one having been confirmed via Kafka. When the partitions responsible for these flows have no leader, **there is no graceful degradation — there is a stop**.

The second factor that amplified the blast radius was the **absence of effective circuit breakers and fallbacks** in the services that depended on those partitions. Instead of failing fast and communicating the unavailability state to clients, services hung — producers blocked waiting for write acknowledgment, consumers stalled waiting for messages that never arrived. This **hang** behavior instead of **fail-fast** is particularly damaging: it consumes resources (threads, connections, memory), propagates unavailability to upstream services waiting for responses, and makes diagnosis harder because the system looks 'busy' rather than 'broken'.

The third factor was **dependency concentration**: two MSK clusters, both affected by the same control plane defect, covering the critical trading topics. There was no path redundancy — if the clusters were in 'healing' state, the services that depended on them simply stopped. The architecture had no **graceful degradation** mechanism for the specific scenario of 'Kafka available but partitions without leaders'.

## Remediation: What Coinbase Did and What Needs to Change

Immediate remediation was necessarily AWS-dependent: Coinbase has no access to the MSK control plane, so resolution required AWS to identify and fix the defect internally. That alone is an important architectural lesson — when your recovery plan depends on a third party acting, your MTTR is outside your control.

In the post-mortem, Coinbase identified several lines of action to prevent a similar incident from having the same impact:

**Partition and leader health observability.** The incident lasted longer than it should have in part because existing dashboards lacked granular visibility into partition leader state. Metrics like `UnderReplicatedPartitions`, `OfflinePartitionsCount`, and `ActiveControllerCount` are standard Kafka metrics that should be in first-level alerts. The absence of leaders on specific partitions should generate an immediate alert, not be discovered during manual diagnosis.

**Timeouts and circuit breakers on Kafka producers and consumers.** Producers that hang indefinitely waiting for write acknowledgment on a leaderless partition are an anti-pattern. Settings like `request.timeout.ms`, `delivery.timeout.ms`, and `max.block.ms` need to be calibrated to ensure fail-fast behavior. On the consumer side, absence of offset progress for a defined period should trigger alerts and ideally activate circuit breaker logic in dependent services.

**Graceful degradation strategy for Kafka failures.** For critical services like quoting and fee calculation, the question 'what do we do when Kafka is unavailable?' needs an explicit architectural answer. This can include: fallback to direct synchronous calculation, fast rejection with a clear error message to the client, or a degraded operating mode with reduced functionality. The worst scenario — which is what happened — is having none of these answers and letting services hang.

**Reducing managed service dependency as a single coordination point.** This is the hardest and most important. I'm not saying to abandon MSK — managed services have real value. But the architecture needs to acknowledge that any managed service can fail in ways you don't control and can't remediate directly. This means: multiple clusters with critical topics replicated, routing capability between clusters, and regular failover tests that include the scenario of 'cluster in an unrecoverable maintenance state'.

## Incident Lessons

- **Managed services have opaque control planes.** MSK is Kafka, but it's not *your* Kafka. The control plane that manages leader re-election, rebalancing, and maintenance is operated by AWS. Defects in that layer are invisible to you and unremediable without AWS intervention. Your architecture needs to be resilient to that opacity.
- **Silent failure is worse than an explicit crash.** A 'healing' cluster that doesn't progress is harder to diagnose than a cluster that refuses connections. Invest in observability that detects absence of progress, not just presence of errors.
- **`UnderReplicatedPartitions` and `OfflinePartitionsCount` are first-level alert metrics.** If you use Kafka in production for critical flows and these metrics are not in your on-call runbook, fix that now.
- **Kafka producers without configured timeouts are time bombs.** `max.block.ms`, `request.timeout.ms`, and `delivery.timeout.ms` must be explicitly configured on all producers for critical services. Kafka's default is too permissive for financial systems.
- **Circuit breakers need to cover the 'Kafka slow/stuck' scenario, not just 'Kafka down'.** Most circuit breaker implementations detect connection errors. Few detect 'producer blocked for more than N seconds'. That second scenario is what happened here.
- **Dependency on two clusters under the same control plane is a logical single point of failure.** Even if they are 'different' clusters, if both are MSK in the same account/region and the control plane fails, both are affected simultaneously. Real diversification requires control plane isolation.

> **My Senior Take:** This incident bothers me in a specific way: it's exactly the kind of failure that experienced Kafka teams don't anticipate, because self-managed Kafka rarely fails *this way*. When you operate your own Kafka, you have visibility into ZooKeeper or the KRaft controller, you can intervene directly, you can force a leader re-election. With MSK, you've outsourced that layer — and with it, the ability to act when it fails.

What would I do differently in the architecture? Three concrete things. **First**, I would never leave critical trading services with a direct, synchronous dependency on a single Kafka path without an explicit fallback mechanism. For fee and quoting, there is almost always a slower but functional synchronous calculation path — that path needs to exist and be tested regularly. **Second**, I would implement a dedicated 'Kafka health watchdog': a simple process that attempts to produce and consume on a heartbeat topic every 30 seconds and alerts if the round-trip exceeds a threshold. I don't trust infrastructure metrics to detect partial failures — I want end-to-end liveness proof. **Third**, and most importantly: I would test the 'MSK in an unrecoverable maintenance state' scenario in my annual game day. Not 'MSK down' — that's easy. 'MSK responding but no leaders on critical partitions'. That specific scenario is what caught Coinbase off guard, and it's what will catch the next company too if they don't test for it explicitly.

On the decision to use MSK: I don't think it was a mistake. Self-managed Kafka at Coinbase's scale has its own significant operational risks. The mistake was not explicitly modeling the risk of opaque control plane failure and not building the corresponding defenses. Managed services don't eliminate risk — they trade one set of risks for another. Your architecture needs to reflect that trade-off.

## Verdict

The May 7, 2026 Coinbase incident is a case study in the **hidden cost of abstraction in critical infrastructure**. Amazon MSK delivered what it promised most of the time — managed Kafka, without the operational burden of running brokers. What wasn't in the implicit contract was: 'and when our control plane fails opaquely, you won't be able to do anything until we intervene'.

There is no villain here. MSK failed in a way that AWS probably also didn't anticipate — defects in managed service control planes are rare and generally fixed quickly. The problem is not that MSK failed. The problem is that Coinbase's architecture was not prepared for the scenario of 'Kafka partially available with leaderless partitions' — a scenario that is harder to detect, harder to diagnose, and harder to remediate than a total failure.

The central lesson is not 'don't use managed services'. It is: **when you use a managed service for a critical coordination function, you need to build observability and resilience for the specific failure modes of that managed service** — including the failure modes that are opaque to you. This means partition health metrics in first-level alerts, aggressive timeouts on producers and consumers, circuit breakers that detect slowness beyond errors, fallbacks for critical paths, and game days that explicitly test the scenario of 'managed service in an unrecoverable maintenance state'.

Coinbase has the engineering maturity to implement all of these defenses — and the public post-mortem demonstrates the accountability culture needed to do so. The cost was high. The lessons are transferable to any team using managed Kafka — or any managed service — in critical business flows.

## References

- [Coinbase — A postmortem of our May 7, 2026 outage](https://www.coinbase.com/blog/a-postmortem-of-our-may-7-2026-outage)

## Case sources

- [Coinbase — A postmortem of our May 7, 2026 outage](https://www.coinbase.com/blog/a-postmortem-of-our-may-7-2026-outage)
