Studies
Post-mortemGitHubDados/Resiliência

GitHub 2018: 43 Seconds of Partition, 24 Hours of MySQL Split-Brain

Oct 21, 2018 10 min AI-assisted
Share:

Listen to study

generated on play

Generated only on first play

On demand
0:000:00
Speed
The MP3 is saved to S3 after the first play.

On October 21, 2018, a 43-second network maintenance window triggered a failure cascade that kept GitHub degraded for nearly 24 hours. Orchestrator promoted replicas in the wrong regions, creating a split-brain between datacenters — revealing how failover automation without proper fencing can turn a micro-failure into a data consistency catastrophe.

Incident Fact Sheet

Company / System
GitHub — code hosting and collaboration platform
Incident Date
October 21, 2018
Total Degradation Duration
~24 hours and 11 minutes (16:00 UTC until ~15:00 UTC the following day)
Initial Network Partition
43 seconds during network equipment maintenance
Impact
Data inconsistency across regions, degraded reads and writes, multiple internal services affected (Issues, Pull Requests, notifications, webhooks)
Stack Involved
MySQL (primary-replica topology), GitHub Orchestrator, ProxySQL, Memcached, GitHub Actions (predecessor), US East + US West datacenters
Root Cause
Automatic replica promotion by Orchestrator during a transient network partition, without fencing the original primary — resulting in two active primaries simultaneously

Forty-three seconds. That was how long GitHub's network was partitioned during a planned maintenance window. There was no catastrophic hardware failure, no application bug, no attack. Just 43 seconds of silence between datacenters — enough time for the failover automation system to make an irreversible decision that would take nearly a full day to correct. This post-mortem analyzes how the interaction between well-intentioned automation, MySQL replication topology, and the absence of fencing mechanisms turned a micro-interruption into one of the most instructive incidents in recent platform engineering history.

What Happened

On October 21, 2018, at 16:00 UTC, GitHub's network engineering team initiated the replacement of a network device in the primary datacenter located on the US East Coast. The procedure was routine — the kind of maintenance that happens dozens of times a year in large-scale infrastructure. During the swap, there was a 43-second connectivity interruption between the US East and US West datacenters.

This interval was sufficient for the GitHub Orchestrator, the MySQL topology management tool used by GitHub (which GitHub itself helped develop as an open source project), to interpret the loss of connectivity to the MySQL primary in US East as a real node failure. Following its automatic promotion logic, Orchestrator elected a replica in US West as the new primary and redirected write traffic to it.

The problem: the original primary in US East had not failed. It was perfectly operational — just temporarily isolated. When connectivity was restored after 43 seconds, the system found itself in a classic split-brain state: two MySQL nodes believing themselves to be the legitimate primary, both accepting writes, both silently diverging in terms of data.

From that moment on, every write arriving at the US East primary and every write arriving at the new US West primary created conflicting versions of reality. Issues created on one side didn't appear on the other. Pull requests updated in one region diverged from the other. Memcached, which serves as a caching layer on top of MySQL, began serving inconsistent data — because the underlying data was inconsistent.

GitHub's team quickly detected that something was wrong — replication alerts fired within minutes. But the fix was not trivial. You can't simply "reconnect" two diverging primaries and expect MySQL to resolve conflicts. MySQL's binlog-based replication is unidirectional and has no native conflict resolution. To restore consistency, the team needed to identify the exact point of divergence in the binlogs, discard writes from the side being demoted, and rebuild the replication topology — all while the production site was still serving traffic.

Incident Timeline

  1. 1

    16:00 UTC — Maintenance Begins

    Network team begins equipment replacement in the US East datacenter. Planned and approved procedure.

  2. 2

    16:00–16:01 UTC — 43-Second Partition

    Connectivity between US East and US West is interrupted for 43 seconds. The MySQL primary in US East becomes unreachable from US West.

  3. 3

    ~16:01 UTC — Orchestrator Promotes Replica

    Orchestrator, operating from US West, detects primary failure and automatically promotes the most advanced replica in US West as the new primary. ProxySQL is reconfigured to direct writes to the new primary.

  4. 4

    ~16:01 UTC — Connectivity Restored, Split-Brain Active

    Network is restored. There are now two active MySQL primaries: the original in US East (which never stopped) and the new one in US West. Both accept writes. Data divergence begins.

  5. 5

    ~16:05 UTC — Alerts Fire

    MySQL replication alerts and inconsistencies in application metrics begin to surface. Engineers are paged. The incident is declared.

  6. 6

    16:00–21:00 UTC — Assessment and Containment

    Teams assess the extent of divergence. Write traffic is gradually redirected. The decision is made to use US East as the source of truth and discard conflicting writes from US West. Binlog analysis to identify the point of divergence.

  7. 7

    21:00 UTC — Controlled Recovery Begins

    Resynchronization process begins. Replicas are rebuilt from the US East primary. Services are gradually restored as consistency is verified cluster by cluster.

  8. 8

    ~15:00 UTC (Oct 22) — Full Recovery

    All MySQL systems return to consistency. Replication topology is fully restored. Monitoring confirms no additional divergences. Incident closed.

Failure Flow: Cross-Datacenter Split-Brain

The diagram reconstructs the system state during the split-brain. The 43s partition caused Orchestrator (in US West) to promote a local replica, while the original primary in US East remained operational — resulting in two primaries simultaneously accepting writes.

🌐 Clientes / Clients
  • GitHub Users · Web + Git
🔀 Camada de Roteamento / Routing Layer
  • ProxySQL · US East
  • ProxySQL · US West
🗄️ US East — Datacenter Primário / Primary DC
  • MySQL Primary · US East ✅ (nunca falhou)
  • MySQL Replica · US East R1
  • MySQL Replica · US East R2
🗄️ US West — Datacenter Secundário / Secondary DC
  • MySQL Promoted · US West ⚠️ (novo primário indevido)
  • MySQL Replica · US West R1
  • Orchestrator · US West
⚡ Cache / Cache Layer
  • Memcached · (dados inconsistentes)

Root Cause: Failover Automation Without Fencing

The root cause was not the network partition itself — 43-second partitions are expected in large-scale distributed systems. The root cause was the absence of a fencing mechanism (STONITH — Shoot The Other Node In The Head) prior to automatic promotion. Orchestrator promoted a replica without first ensuring the original primary was truly dead and incapable of accepting writes. In database systems, the rule is absolute: you never promote a new primary without being certain the old one has been isolated. Automation that acts faster than the time needed to confirm the previous node's death is automation that can create split-brain. Failover speed and consistency safety are opposing forces — and in this case, speed won destructively.

The Complexity of Remediation

When GitHub's team confirmed the split-brain, the first critical decision was: which side of the divergence to preserve? This is not a trivial question. In a system with millions of active users, both sides of the split-brain had legitimate writes from real users. Choosing US East as the source of truth meant discarding writes that had happened in US West after the promotion — writes that users had made in good faith and expected to be persisted.

The team decided to use US East as the source of truth, based on the fact that it was the original primary and had the longest and most reliable replication history. Conflicting writes from US West were identified through binlog analysis — a manual and meticulous process of comparing transaction logs from both sides to find the exact point where the histories diverged.

The recovery process was deliberately slow and cautious. The team couldn't simply run CHANGE MASTER TO and hope for the best. Each MySQL cluster needed to be verified individually. Memcached needed to be completely invalidated to prevent stale data from continuing to be served after resynchronization. Services that depended on eventual consistency needed to be paused or placed in read-only mode while the topology was rebuilt.

An important detail the official post-mortem reveals: during the recovery process, the team disabled Orchestrator to prevent it from making further automatic decisions while the topology was in an inconsistent state. This is an implicit acknowledgment that automation, at that moment, was an additional risk, not an aid. The recovery was essentially manual — experienced engineers navigating MySQL binlogs at 2 AM, making surgical decisions about which transactions to preserve.

The total cost was approximately 24 hours of service degradation, with varying impact across different features. Some features were completely unavailable; others operated in degraded mode with the possibility of inconsistency. GitHub's public communication during the incident was exemplary — frequent updates on the status page, transparency about the nature of the problem, and a detailed post-mortem published 9 days later.

Why Orchestrator Acted This Way — and Why It Makes Sense

It's important not to demonize Orchestrator. It did exactly what it was configured to do: detect primary unavailability and promote the best available replica. In 99% of real failure scenarios — where the primary actually died, where a disk failed, where a process hung — this behavior is correct and saves operational lives.

The problem is that Orchestrator, like any automatic failover system, operates under an implicit assumption: if it can't reach the primary, the primary is dead. This assumption is reasonable but not universal. Transient network partitions — especially during planned maintenance — create exactly the scenario where the assumption fails.

The technical solution to this problem has been known for decades in distributed systems literature: quorum-based fencing. Before promoting a replica, the failover system must obtain confirmation from a majority of nodes (or an external arbiter) that the primary is truly unreachable — and, ideally, must execute a fencing mechanism to ensure the old primary cannot accept writes even if it comes back online.

In the MySQL context, this can be implemented in several ways:

  • STONITH via IPMI/iDRAC: physically power off the node via management interface before promotion
  • VIP (Virtual IP) revocation: revoke the primary's virtual IP before assigning it to the new one
  • Semi-synchronous replication with rpl_semi_sync_master_wait_for_slave_count: ensure writes are only acknowledged when at least N replicas have received them
  • MySQL Group Replication or InnoDB Cluster: topologies with native consensus based on Paxos/Raft

GitHub, after the incident, implemented several of these improvements. The post-mortem specifically mentions adding fencing checks and revising Orchestrator timeouts to be more conservative in maintenance environments.

There is also a lesson about multi-datacenter topology: in an active-active or active-passive configuration with replicas in multiple regions, the failover system needs to be topology-aware — it should not promote a replica in a different region without considering replication lag and the risk of divergence. A replica in US West is, by definition, a few milliseconds behind the primary in US East. Promoting it as primary during a transient partition guarantees that writes made to the original primary during those milliseconds will be lost or conflicting.

Technical Lessons from the Incident

Failover automation without fencing is a double-edged sword: it accelerates recovery in real failures, but can create split-brain during transient partitions. The failure detection timeout should be larger than the expected duration of maintenance partitions.
Fencing is non-negotiable in database systems: before promoting a new primary, the system must ensure — not assume — that the old primary is isolated. STONITH, VIP revocation, or explicit quorum are acceptable mechanisms.
Maintenance windows must be coordinated with automation systems: Orchestrator should have been placed in maintenance mode (no automatic promotions) during the equipment swap window. Automation that doesn't respect operational context is dangerous automation.
MySQL split-brain recovery is expensive and manual: the absence of native conflict resolution in MySQL binlog replication means any split-brain requires manual intervention from experienced engineers, binlog analysis, and decisions about which writes to discard.
Cache layers amplify data inconsistencies: Memcached serving data from an inconsistent MySQL multiplied the split-brain impact for end users. Cache invalidation must be part of the split-brain recovery playbook.
Multi-datacenter topologies require topology-aware failover: promoting a cross-datacenter replica during a transient partition is particularly risky due to inherent replication lag. The failover system should treat cross-region promotions with more conservative criteria.
FA
My Perspective: The Real Problem is Blind Trust in Automation
Senior Solutions Architect

After 16 years working with systems that cannot fail — financial platforms, payment systems, critical data infrastructure — what strikes me about this incident is not what broke, but what was assumed. Orchestrator was configured to act quickly. Quickly enough to outrun the duration of a planned maintenance partition. That's a calibration error, not a product error. The question every team must ask when configuring failover automation is: what is the cost of a false positive versus a false negative? In database systems, a false positive (promoting when the primary is still alive) is catastrophically more expensive than a false negative (not promoting when the primary actually died). You can recover from a dead primary with downtime. You cannot recover from a split-brain without potential data loss. If I were designing this architecture today, I would implement three layers of protection: (1) explicit maintenance mode in Orchestrator that must be activated before any network maintenance window — with enforcement via runbook and pre-checklist automation; (2) semi-synchronous replication to ensure the primary doesn't acknowledge writes that haven't been received by at least one replica, reducing the divergence window; and (3) fencing via VIP revocation before any promotion, ensuring the old primary loses the write address before the new one takes over. What concerns me most about this incident is what it reveals about the illusion of control that automation creates. Teams that implement Orchestrator (or Patroni, or MHA, or any other automatic failover

Verdict: Automation Speed vs. Consistency Safety

GitHub's October 2018 incident is a definitive case study on the CAP theorem applied to real operations: when you have a network partition, you must choose between availability and consistency. Orchestrator chose availability — it promoted a replica to keep the write service active. The price was 24 hours of data inconsistency. The central lesson is not that failover automation is bad. It's that failover automation without fencing is incomplete. A failover system that cannot guarantee the old primary is dead before promoting the new one is not a high availability system — it's a high split-brain probability system. The changes GitHub implemented after the incident — more conservative timeouts, fencing checks, better coordination between maintenance windows and automation — are the correct path. But the deeper lesson is cultural: teams operating distributed databases need a clear mental model of how their failover systems behave during transient partitions, not just total failures. That knowledge doesn't come from documentation — it comes from deliberate chaos engineering, from gamedays, from simulating exactly the scenario that happened on October 21, 2018 in a contro

#postmortem#mysql#split-brain#orchestrator#resiliência#dados#failover#github
Share:
Written with AI assistance from the public case and my architect's reading.