Meta 2021: How a Maintenance Command Took Facebook Down for 6 Hours
Listen to study
generated on playGenerated only on first play
On October 4, 2021, a misconfigured maintenance command withdrew all BGP routes from Meta's backbone, making authoritative DNS servers unreachable from the internet. The cascade effect took down Facebook, Instagram, WhatsApp, and Oculus simultaneously for approximately six hours, affecting billions of users and exposing critical weaknesses in access control systems and emergency recovery procedures.
Incident Facts
- Company / System
- Meta (Facebook, Instagram, WhatsApp, Oculus)
- Date
- October 4, 2021
- Duration
- ~6 hours (approximately 15:51 UTC to ~21:45 UTC)
- User impact
- ~3.5 billion users unable to access all Meta services
- Financial impact (estimated)
- ~US$ 60–100 million in lost ad revenue (market estimate)
- Affected services
- Facebook, Instagram, WhatsApp, Oculus, Workplace, Meta internal systems
- Root cause
- Accidental withdrawal of BGP routes from backbone during network capacity audit
- Stack / Protocols
- BGP (Border Gateway Protocol), internal authoritative DNS, FBTUN (proprietary backbone), internal network management tooling
- Failure type
- Human error amplified by absent safeguards and excessive coupling between control plane and data plane
On October 4, 2021, a Meta engineer executed a routine maintenance command to audit backbone network capacity. The command worked exactly as designed — and that was precisely the problem. Within minutes, all BGP routes advertising Meta's IP prefixes to the internet were withdrawn. The company's authoritative DNS servers, which reside within that same infrastructure, became unreachable. Facebook, Instagram, WhatsApp, and Oculus vanished from the internet simultaneously. What followed was one of the largest service outages in modern internet history — and a masterclass in how poorly designed control systems can turn a trivial task into a global catastrophe.
What Happened: From Audit to Collapse
To understand the incident, one must first understand how Meta structures its global network. The company operates one of the largest private networks in the world, with points of presence (PoPs) in dozens of countries connected by a proprietary backbone. This backbone is advertised to the internet via BGP — the inter-autonomous-system routing protocol that, simply put, tells the rest of the internet "these IP prefixes belong to Meta, send traffic here."
On the morning of October 4, an engineering team was performing maintenance work on the backbone infrastructure to audit available capacity. They used an internal network management tool that, among other functions, allows modifying router configurations across the network spine. The issued command was intended to perform a capacity scan — but due to a bug in the tool or an incorrect command configuration (Meta did not publicly specify which), the system interpreted the instruction as an order to disable BGP sessions on all backbone routers simultaneously.
The result was immediate and total: all of Meta's IP prefixes were withdrawn from global BGP. From the internet's perspective, Meta simply ceased to exist. No DNS resolver in the world could reach Meta's authoritative name servers to resolve facebook.com, instagram.com, or whatsapp.com — because those servers sat behind the same infrastructure that had just been disconnected. DNS itself wasn't broken; it was simply unreachable.
What made the situation even more critical was the collateral effect on Meta's own internal systems. Monitoring tools, remote access systems, and even physical data center badge readers depended on services that also went offline. Engineers attempting to diagnose the problem remotely lost access to the very tools needed to fix it. The team had to dispatch technicians physically to the data centers — a process that took hours, especially since physical access control systems were also partially affected.
Incident Timeline
- 1
~15:51 UTC — Maintenance command issued
Engineer executes internal network management tool for backbone capacity audit. The command, due to a bug or misconfiguration, instructs routers to disable BGP sessions across the entire spine.
- 2
~15:51–15:53 UTC — Cascading BGP route withdrawal
Within approximately two minutes, all Meta IP prefixes are withdrawn from global BGP. Routers worldwide stop forwarding traffic to Meta infrastructure. DNS entry TTLs begin expiring in recursive resolvers.
- 3
~15:53 UTC — DNS fails globally
Recursive DNS resolvers worldwide attempt to query Meta's authoritative name servers (a.ns.facebook.com, b.ns.facebook.com, etc.) and receive timeouts. facebook.com, instagram.com, and whatsapp.com become unresolvable. Users begin reporting failures.
- 4
~16:00 UTC — Internal alerts fire; remote access compromised
Meta's monitoring systems detect the outage, but incident response tools themselves depend on services that are also offline. Remote engineers lose access to diagnostic and management tools. Internal communication via internet-based systems is also affected.
- 5
~16:30–18:00 UTC — Physical access to data centers
Teams are physically dispatched to data centers. The process is slow: physical access control systems (badge readers) partially depend on affected services. Technicians require manual emergency authorizations to enter some facilities.
- 6
~18:00–20:00 UTC — Diagnosis and start of recovery
With physical access established, engineers identify the root cause: disabled BGP sessions on backbone routers. They begin the manual process of re-enabling BGP sessions. Recovery is deliberately gradual to avoid sudden load surge on systems coming back online.
- 7
~20:00–21:45 UTC — Gradual service restoration
BGP routes are progressively re-added. Authoritative DNS servers become reachable again. Recursive resolvers worldwide begin resolving Meta's domains again. Services are restored in a staggered fashion to prevent thundering herd.
- 8
~21:45 UTC — Services fully restored
All Meta services report normal operation. Total incident duration: approximately 6 hours.
Failure Flow: BGP, DNS, and the Cascading Collapse
The diagram reconstructs the failure flow: how the withdrawal of BGP routes from the backbone made authoritative DNS servers unreachable, which in turn made all Meta services unresolvable to the entire world.
- End User · Browser / App
- ISP Recursive · DNS Resolver
- Global BGP · Routing Table
- Edge Routers · (BGP Peers)
- Backbone Routers · (BGP Sessions DISABLED ❌)
- Network Mgmt Tool · (Maintenance Cmd ⚠️)
- Engineer · (Issued Command)
- Authoritative DNS · (a/b.ns.facebook.com) 🔴
- Facebook App · Servers 🔴
- Instagram App · Servers 🔴
- WhatsApp App · Servers 🔴
- Internal Tools · & Monitoring 🔴
Root Cause: Fatal Coupling Between Control Plane and Data Plane
The root cause was not just the incorrect command — it was a systemic design failure across three overlapping layers:
1. Absence of dry-run and blast radius validation: The network management tool allowed issuing commands affecting all backbone routers without any prior simulation mechanism, staged confirmation, or scope limiting. A command that should have been surgical could be — and was — global.
2. Authoritative DNS inside the affected perimeter: Meta's authoritative name servers (a.ns.facebook.com, b.ns.facebook.com) were advertised via the same BGP prefixes that were withdrawn. This created a fatal circular dependency: to reach any Meta service, the world needed to resolve DNS; to resolve DNS, it needed to reach the authoritative servers; to reach the authoritative servers, it needed the BGP routes that had just disappeared.
3. Recovery control plane dependent on the compromised data plane: The tools engineers would need to diagnose and fix the problem — monitoring systems, remote access, internal communication — depended on the same infrastructure that was offline. The recovery system was coupled to the system that failed.
The Authoritative DNS Trap: An Invisible Circular Dependency
One of the most technically interesting aspects of this incident is the specific role of DNS — and why it could not recover automatically even when parts of the infrastructure might have been restored more quickly.
DNS works in layers. When a recursive resolver (like Google's 8.8.8.8 or your ISP's resolver) needs to resolve facebook.com, it first queries root servers to find who is authoritative for .com, then queries TLD servers to find who is authoritative for facebook.com, and finally queries Meta's name servers directly. Those name servers (a.ns.facebook.com, b.ns.facebook.com, etc.) are the ones that return the actual IP addresses of application servers.
The problem is that Meta's authoritative name servers are IP addresses within Meta's own prefixes — the same ones withdrawn from BGP. When a resolver tried to reach them, packets simply had no route. This wasn't a DNS failure in the sense of "the server responded with an error" — it was a network reachability failure. The server was there, running, but no packets could reach it.
This has an important implication for DNS TTL (Time To Live). Meta's DNS records had relatively short TTLs — in the range of minutes. This means recursive resolvers worldwide quickly discarded their cached entries and attempted revalidation. Every revalidation attempt resulted in a timeout. The result was a storm of DNS queries failing globally, with no fallback possibility, because there were no alternative authoritative servers outside the affected perimeter.
A resilience practice that would have mitigated this would be maintaining authoritative DNS servers on separate ASNs (Autonomous System Numbers), with independent BGP prefixes, advertised by different transit providers. This is exactly what large managed DNS operators (like Cloudflare, AWS Route 53, NS1) do by design — they distribute their authoritative servers across multiple ASNs so that the failure of one doesn't compromise domain resolvability.
Remediation: Why It Took 6 Hours to Fix a Command
The natural question when hearing about this incident is: "if the problem was a command that disabled BGP sessions, why wasn't it enough to just re-enable the BGP sessions?" The answer reveals the second layer of systemic failure: the recovery control plane was inside the compromised perimeter.
Meta operates infrastructure of extraordinary scale — dozens of data centers, hundreds of points of presence, thousands of routers. Managing this infrastructure is done through sophisticated internal tools, accessible via corporate networks that, in turn, depend on the same backbone systems that were offline. When the backbone went down, remote engineers lost access to network management tools. It wasn't possible to simply open a terminal and re-enable BGP sessions — the tools to do so were inaccessible.
The solution was to physically dispatch engineers to data centers. This introduced significant delays for several reasons: the logistics of mobilizing people to physical locations takes time; some data centers are in remote or controlled-access locations; and, ironically, physical access control systems (badge readers, door authentication systems) also partially depended on services that were offline, requiring manual emergency authorization procedures.
Once physical access was established, engineers could connect directly to routers via console and manually re-enable BGP sessions. But even at that point, recovery had to be carefully staged. When Meta's services became reachable again, billions of clients (mobile apps, browsers, third-party systems) attempted to reconnect simultaneously. An abrupt restoration could generate a thundering herd — an avalanche of requests that would overwhelm systems still recovering. The team opted for a gradual restoration, progressively re-adding BGP prefixes and monitoring load at each step.
This experience led Meta to announce structural changes: clearer separation between the management control plane and production infrastructure, creation of out-of-band emergency access channels that don't depend on production services, and revision of validation processes for high-impact commands on network infrastructure.
Technical Lessons from the Incident
I've worked with financial systems where a poorly executed maintenance window can have severe regulatory and reputational consequences. The Meta incident resonates because the failure pattern is universal: infrastructure tooling that doesn't have the concept of "maximum impact scope" built in as a hardware-level constraint, not just software.
If I were to redesign Meta's network management system based on what we know now, I would prioritize three structural changes:
First, mandatory control plane separation. The infrastructure management network must be physically separated from the production network — not just logically separated via VLANs or VRFs, but with independent transit uplinks and distinct ASNs. This is standard in serious telecommunications operators and should be standard in any company that operates its own backbone. The cost is real, but it's a fraction of the cost of a single incident of this magnitude.
Second, authoritative DNS on truly independent ASNs. It's not enough to have DNS servers in different data centers if they're all advertised by the same AS. Meta should have (and probably does now) authoritative servers advertised by separate ASNs with distinct transit providers — exactly as Cloudflare, AWS Route 53, and other major DNS operators do. This isn't paranoia; it's basic resilience engineering for an asset as critical as DNS.
Third, and most importantly: blast radius as a first-class primitive in tooling. Every tool that can issue commands to network infrastructure must have a mandatory --max-scope parameter or equivalent, wit
Verdict: A Design Failure, Not an Operational One
The Meta incident of October 2021 is frequently described as "human error" — and technically, a human issued a command that caused the problem. But that description is incomplete and, in my view, dangerous, because it diverts attention from the systemic failures that made it possible for a single command to take down the entire company. The real failure was a design failure at three levels: (1) a network management tool without blast radius constraints; (2) authoritative DNS servers within the same BGP perimeter they depended on, creating a circular dependency; and (3) a recovery control plane coupled to the data plane that failed. Any one of these failures individually would have been manageable. All three together transformed an operational error into a six-hour global catastrophe. The central lesson for architects and senior engineers is this: in critical-scale systems, the question is not "can we prevent all human errors?" — the answer is always no. The question is "when an inevitable human error occurs, what is the maximum possible blast radius, and are we comfortable with it?" If the answer is "it can take everything down," the design needs to change. The incident also