# GitHub 2018: 43 Seconds of Partition, 24 Hours of MySQL Split-Brain

On October 21, 2018, a 43-second network maintenance window triggered a failure cascade that kept GitHub degraded for nearly 24 hours. Orchestrator promoted replicas in the wrong regions, creating a split-brain between datacenters — revealing how failover automation without proper fencing can turn a micro-failure into a data consistency catastrophe.

- URL: https://fernando.moretes.com/studies/github-2018-network-partition

- Markdown: https://fernando.moretes.com/studies/github-2018-network-partition/study.md?lang=en

- Type: Post-mortem

- Company: GitHub

- Domain: Dados/Resiliência

- Date: 2018-10-21

- Tags: postmortem, mysql, split-brain, orchestrator, resiliência, dados, failover, github

- Reading time: 10 min

---

## Incident Fact Sheet

- **Company / System:** GitHub — code hosting and collaboration platform
- **Incident Date:** October 21, 2018
- **Total Degradation Duration:** ~24 hours and 11 minutes (16:00 UTC until ~15:00 UTC the following day)
- **Initial Network Partition:** 43 seconds during network equipment maintenance
- **Impact:** Data inconsistency across regions, degraded reads and writes, multiple internal services affected (Issues, Pull Requests, notifications, webhooks)
- **Stack Involved:** MySQL (primary-replica topology), GitHub Orchestrator, ProxySQL, Memcached, GitHub Actions (predecessor), US East + US West datacenters
- **Root Cause:** Automatic replica promotion by Orchestrator during a transient network partition, without fencing the original primary — resulting in two active primaries simultaneously

Forty-three seconds. That was how long GitHub's network was partitioned during a planned maintenance window. There was no catastrophic hardware failure, no application bug, no attack. Just 43 seconds of silence between datacenters — enough time for the failover automation system to make an irreversible decision that would take nearly a full day to correct. This post-mortem analyzes how the interaction between well-intentioned automation, MySQL replication topology, and the absence of fencing mechanisms turned a micro-interruption into one of the most instructive incidents in recent platform engineering history.

## What Happened

On October 21, 2018, at 16:00 UTC, GitHub's network engineering team initiated the replacement of a network device in the primary datacenter located on the US East Coast. The procedure was routine — the kind of maintenance that happens dozens of times a year in large-scale infrastructure. During the swap, there was a 43-second connectivity interruption between the US East and US West datacenters.

This interval was sufficient for the **GitHub Orchestrator**, the MySQL topology management tool used by GitHub (which GitHub itself helped develop as an open source project), to interpret the loss of connectivity to the MySQL primary in US East as a real node failure. Following its automatic promotion logic, Orchestrator elected a replica in US West as the new primary and redirected write traffic to it.

The problem: the original primary in US East **had not failed**. It was perfectly operational — just temporarily isolated. When connectivity was restored after 43 seconds, the system found itself in a classic **split-brain state**: two MySQL nodes believing themselves to be the legitimate primary, both accepting writes, both silently diverging in terms of data.

From that moment on, every write arriving at the US East primary and every write arriving at the new US West primary created conflicting versions of reality. Issues created on one side didn't appear on the other. Pull requests updated in one region diverged from the other. Memcached, which serves as a caching layer on top of MySQL, began serving inconsistent data — because the underlying data was inconsistent.

GitHub's team quickly detected that something was wrong — replication alerts fired within minutes. But the **fix was not trivial**. You can't simply "reconnect" two diverging primaries and expect MySQL to resolve conflicts. MySQL's binlog-based replication is unidirectional and has no native conflict resolution. To restore consistency, the team needed to identify the exact point of divergence in the binlogs, discard writes from the side being demoted, and rebuild the replication topology — all while the production site was still serving traffic.

## Incident Timeline

1. **16:00 UTC — Maintenance Begins** — Network team begins equipment replacement in the US East datacenter. Planned and approved procedure.

2. **16:00–16:01 UTC — 43-Second Partition** — Connectivity between US East and US West is interrupted for 43 seconds. The MySQL primary in US East becomes unreachable from US West.

3. **~16:01 UTC — Orchestrator Promotes Replica** — Orchestrator, operating from US West, detects primary failure and automatically promotes the most advanced replica in US West as the new primary. ProxySQL is reconfigured to direct writes to the new primary.

4. **~16:01 UTC — Connectivity Restored, Split-Brain Active** — Network is restored. There are now two active MySQL primaries: the original in US East (which never stopped) and the new one in US West. Both accept writes. Data divergence begins.

5. **~16:05 UTC — Alerts Fire** — MySQL replication alerts and inconsistencies in application metrics begin to surface. Engineers are paged. The incident is declared.

6. **16:00–21:00 UTC — Assessment and Containment** — Teams assess the extent of divergence. Write traffic is gradually redirected. The decision is made to use US East as the source of truth and discard conflicting writes from US West. Binlog analysis to identify the point of divergence.

7. **21:00 UTC — Controlled Recovery Begins** — Resynchronization process begins. Replicas are rebuilt from the US East primary. Services are gradually restored as consistency is verified cluster by cluster.

8. **~15:00 UTC (Oct 22) — Full Recovery** — All MySQL systems return to consistency. Replication topology is fully restored. Monitoring confirms no additional divergences. Incident closed.

## Failure Flow: Cross-Datacenter Split-Brain

The diagram reconstructs the system state during the split-brain. The 43s partition caused Orchestrator (in US West) to promote a local replica, while the original primary in US East remained operational — resulting in two primaries simultaneously accepting writes.

### 🌐 Clientes / Clients

- GitHub Users Web + Git (user)

### 🔀 Camada de Roteamento / Routing Layer

- ProxySQL US East (network)
- ProxySQL US West (network)

### 🗄️ US East — Datacenter Primário / Primary DC

- MySQL Primary US East ✅ (nunca falhou) (data)
- MySQL Replica US East R1 (data)
- MySQL Replica US East R2 (data)

### 🗄️ US West — Datacenter Secundário / Secondary DC

- MySQL Promoted US West ⚠️ (novo primário indevido) (data)
- MySQL Replica US West R1 (data)
- Orchestrator US West (compute)

### ⚡ Cache / Cache Layer

- Memcached (dados inconsistentes) (storage)

### Flows

- users -> proxysql_e: writes/reads
- users -> proxysql_w: writes/reads
- proxysql_e -> mysql_primary_e: write (before)
- proxysql_w -> mysql_promoted: write (after promotion)
- mysql_primary_e -> replica_e1: replication
- mysql_primary_e -> replica_e2: replication
- mysql_promoted -> replica_w1: replication
- orchestrator -> mysql_promoted: promoted during partition
- mysql_primary_e -> mysql_promoted: ⚡ 43s partition
- mysql_primary_e -> memcached: cache invalidation
- mysql_promoted -> memcached: cache invalidation (conflict)

> **Root Cause: Failover Automation Without Fencing:** The root cause was not the network partition itself — 43-second partitions are expected in large-scale distributed systems. The root cause was the **absence of a fencing mechanism (STONITH — Shoot The Other Node In The Head)** prior to automatic promotion. Orchestrator promoted a replica without first ensuring the original primary was truly dead and incapable of accepting writes. In database systems, the rule is absolute: **you never promote a new primary without being certain the old one has been isolated**. Automation that acts faster than the time needed to confirm the previous node's death is automation that can create split-brain. Failover speed and consistency safety are opposing forces — and in this case, speed won destructively.

## The Complexity of Remediation

When GitHub's team confirmed the split-brain, the first critical decision was: **which side of the divergence to preserve?** This is not a trivial question. In a system with millions of active users, both sides of the split-brain had legitimate writes from real users. Choosing US East as the source of truth meant discarding writes that had happened in US West after the promotion — writes that users had made in good faith and expected to be persisted.

The team decided to use **US East as the source of truth**, based on the fact that it was the original primary and had the longest and most reliable replication history. Conflicting writes from US West were identified through binlog analysis — a manual and meticulous process of comparing transaction logs from both sides to find the exact point where the histories diverged.

The recovery process was deliberately slow and cautious. The team couldn't simply run `CHANGE MASTER TO` and hope for the best. Each MySQL cluster needed to be verified individually. Memcached needed to be completely invalidated to prevent stale data from continuing to be served after resynchronization. Services that depended on eventual consistency needed to be paused or placed in read-only mode while the topology was rebuilt.

An important detail the official post-mortem reveals: during the recovery process, the team **disabled Orchestrator** to prevent it from making further automatic decisions while the topology was in an inconsistent state. This is an implicit acknowledgment that automation, at that moment, was an additional risk, not an aid. The recovery was essentially manual — experienced engineers navigating MySQL binlogs at 2 AM, making surgical decisions about which transactions to preserve.

The total cost was approximately 24 hours of service degradation, with varying impact across different features. Some features were completely unavailable; others operated in degraded mode with the possibility of inconsistency. GitHub's public communication during the incident was exemplary — frequent updates on the status page, transparency about the nature of the problem, and a detailed post-mortem published 9 days later.

## Why Orchestrator Acted This Way — and Why It Makes Sense

It's important not to demonize Orchestrator. It did exactly what it was configured to do: detect primary unavailability and promote the best available replica. In 99% of real failure scenarios — where the primary actually died, where a disk failed, where a process hung — this behavior is correct and saves operational lives.

The problem is that Orchestrator, like any automatic failover system, operates under an **implicit assumption**: if it can't reach the primary, the primary is dead. This assumption is reasonable but not universal. Transient network partitions — especially during planned maintenance — create exactly the scenario where the assumption fails.

The technical solution to this problem has been known for decades in distributed systems literature: **quorum-based fencing**. Before promoting a replica, the failover system must obtain confirmation from a majority of nodes (or an external arbiter) that the primary is truly unreachable — and, ideally, must execute a fencing mechanism to ensure the old primary cannot accept writes even if it comes back online.

In the MySQL context, this can be implemented in several ways:
- **STONITH via IPMI/iDRAC**: physically power off the node via management interface before promotion
- **VIP (Virtual IP) revocation**: revoke the primary's virtual IP before assigning it to the new one
- **Semi-synchronous replication with `rpl_semi_sync_master_wait_for_slave_count`**: ensure writes are only acknowledged when at least N replicas have received them
- **MySQL Group Replication or InnoDB Cluster**: topologies with native consensus based on Paxos/Raft

GitHub, after the incident, implemented several of these improvements. The post-mortem specifically mentions adding fencing checks and revising Orchestrator timeouts to be more conservative in maintenance environments.

There is also a lesson about **multi-datacenter topology**: in an active-active or active-passive configuration with replicas in multiple regions, the failover system needs to be **topology-aware** — it should not promote a replica in a different region without considering replication lag and the risk of divergence. A replica in US West is, by definition, a few milliseconds behind the primary in US East. Promoting it as primary during a transient partition guarantees that writes made to the original primary during those milliseconds will be lost or conflicting.

## Technical Lessons from the Incident

- **Failover automation without fencing is a double-edged sword**: it accelerates recovery in real failures, but can create split-brain during transient partitions. The failure detection timeout should be larger than the expected duration of maintenance partitions.
- **Fencing is non-negotiable in database systems**: before promoting a new primary, the system must ensure — not assume — that the old primary is isolated. STONITH, VIP revocation, or explicit quorum are acceptable mechanisms.
- **Maintenance windows must be coordinated with automation systems**: Orchestrator should have been placed in maintenance mode (no automatic promotions) during the equipment swap window. Automation that doesn't respect operational context is dangerous automation.
- **MySQL split-brain recovery is expensive and manual**: the absence of native conflict resolution in MySQL binlog replication means any split-brain requires manual intervention from experienced engineers, binlog analysis, and decisions about which writes to discard.
- **Cache layers amplify data inconsistencies**: Memcached serving data from an inconsistent MySQL multiplied the split-brain impact for end users. Cache invalidation must be part of the split-brain recovery playbook.
- **Multi-datacenter topologies require topology-aware failover**: promoting a cross-datacenter replica during a transient partition is particularly risky due to inherent replication lag. The failover system should treat cross-region promotions with more conservative criteria.

> **My Perspective: The Real Problem is Blind Trust in Automation:** After 16 years working with systems that cannot fail — financial platforms, payment systems, critical data infrastructure — what strikes me about this incident is not what broke, but **what was assumed**.

Orchestrator was configured to act quickly. Quickly enough to outrun the duration of a planned maintenance partition. That's a calibration error, not a product error. The question every team must ask when configuring failover automation is: **what is the cost of a false positive versus a false negative?** In database systems, a false positive (promoting when the primary is still alive) is catastrophically more expensive than a false negative (not promoting when the primary actually died). You can recover from a dead primary with downtime. You cannot recover from a split-brain without potential data loss.

If I were designing this architecture today, I would implement three layers of protection: (1) **explicit maintenance mode in Orchestrator** that must be activated before any network maintenance window — with enforcement via runbook and pre-checklist automation; (2) **semi-synchronous replication** to ensure the primary doesn't acknowledge writes that haven't been received by at least one replica, reducing the divergence window; and (3) **fencing via VIP revocation** before any promotion, ensuring the old primary loses the write address before the new one takes over.

What concerns me most about this incident is what it reveals about **the illusion of control that automation creates**.

## Verdict: Automation Speed vs. Consistency Safety

GitHub's October 2018 incident is a definitive case study on the **CAP theorem applied to real operations**: when you have a network partition, you must choose between availability and consistency. Orchestrator chose availability — it promoted a replica to keep the write service active. The price was 24 hours of data inconsistency.

The central lesson is not that failover automation is bad. It's that **failover automation without fencing is incomplete**. A failover system that cannot guarantee the old primary is dead before promoting the new one is not a high availability system — it's a high split-brain probability system.

The changes GitHub implemented after the incident — more conservative timeouts, fencing checks, better coordination between maintenance windows and automation — are the correct path. But the deeper lesson is cultural: **teams operating distributed databases need a clear mental model of how their failover systems behave during transient partitions**, not just total failures.

## References

- [GitHub — October 21 Post-Incident Analysis (Official)](https://github.blog/2018-10-30-oct21-post-incident-analysis/)
- [GitHub Orchestrator — MySQL Topology Management](https://github.com/openark/orchestrator)
- [MySQL Semi-Synchronous Replication Documentation](https://dev.mysql.com/doc/refman/8.0/en/replication-semisync.html)
- [STONITH / Fencing in High Availability Clusters](https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch09.html)
- [Designing Data-Intensive Applications — Martin Kleppmann (Cap. 5: Replication, Split-Brain)](https://dataintensive.net/)

## Case sources

- [GitHub — October 21 post-incident analysis](https://github.blog/2018-10-30-oct21-post-incident-analysis/)