Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Post-mortemMetaRede

Meta 2021: How a Maintenance Command Took Facebook Down for 6 Hours

Oct 4, 2021 11 min AI-assisted

Listen to study

generated on play

Generated only on first play

On demand

0:000:00

Speed

The MP3 is saved to S3 after the first play.

On October 4, 2021, a misconfigured maintenance command withdrew all BGP routes from Meta's backbone, making authoritative DNS servers unreachable from the internet. The cascade effect took down Facebook, Instagram, WhatsApp, and Oculus simultaneously for approximately six hours, affecting billions of users and exposing critical weaknesses in access control systems and emergency recovery procedures.

Incident Facts

Company / System: Meta (Facebook, Instagram, WhatsApp, Oculus)
Date: October 4, 2021
Duration: ~6 hours (approximately 15:51 UTC to ~21:45 UTC)
User impact: ~3.5 billion users unable to access all Meta services
Financial impact (estimated): ~US$ 60–100 million in lost ad revenue (market estimate)
Affected services: Facebook, Instagram, WhatsApp, Oculus, Workplace, Meta internal systems
Root cause: Accidental withdrawal of BGP routes from backbone during network capacity audit
Stack / Protocols: BGP (Border Gateway Protocol), internal authoritative DNS, FBTUN (proprietary backbone), internal network management tooling
Failure type: Human error amplified by absent safeguards and excessive coupling between control plane and data plane

On October 4, 2021, a Meta engineer executed a routine maintenance command to audit backbone network capacity. The command worked exactly as designed — and that was precisely the problem. Within minutes, all BGP routes advertising Meta's IP prefixes to the internet were withdrawn. The company's authoritative DNS servers, which reside within that same infrastructure, became unreachable. Facebook, Instagram, WhatsApp, and Oculus vanished from the internet simultaneously. What followed was one of the largest service outages in modern internet history — and a masterclass in how poorly designed control systems can turn a trivial task into a global catastrophe.

What Happened: From Audit to Collapse

To understand the incident, one must first understand how Meta structures its global network. The company operates one of the largest private networks in the world, with points of presence (PoPs) in dozens of countries connected by a proprietary backbone. This backbone is advertised to the internet via BGP — the inter-autonomous-system routing protocol that, simply put, tells the rest of the internet "these IP prefixes belong to Meta, send traffic here."

On the morning of October 4, an engineering team was performing maintenance work on the backbone infrastructure to audit available capacity. They used an internal network management tool that, among other functions, allows modifying router configurations across the network spine. The issued command was intended to perform a capacity scan — but due to a bug in the tool or an incorrect command configuration (Meta did not publicly specify which), the system interpreted the instruction as an order to disable BGP sessions on all backbone routers simultaneously.

The result was immediate and total: all of Meta's IP prefixes were withdrawn from global BGP. From the internet's perspective, Meta simply ceased to exist. No DNS resolver in the world could reach Meta's authoritative name servers to resolve facebook.com, instagram.com, or whatsapp.com — because those servers sat behind the same infrastructure that had just been disconnected. DNS itself wasn't broken; it was simply unreachable.

What made the situation even more critical was the collateral effect on Meta's own internal systems. Monitoring tools, remote access systems, and even physical data center badge readers depended on services that also went offline. Engineers attempting to diagnose the problem remotely lost access to the very tools needed to fix it. The team had to dispatch technicians physically to the data centers — a process that took hours, especially since physical access control systems were also partially affected.

Incident Timeline

1
~15:51 UTC — Maintenance command issued
Engineer executes internal network management tool for backbone capacity audit. The command, due to a bug or misconfiguration, instructs routers to disable BGP sessions across the entire spine.
2
~15:51–15:53 UTC — Cascading BGP route withdrawal
Within approximately two minutes, all Meta IP prefixes are withdrawn from global BGP. Routers worldwide stop forwarding traffic to Meta infrastructure. DNS entry TTLs begin expiring in recursive resolvers.
3
~15:53 UTC — DNS fails globally
Recursive DNS resolvers worldwide attempt to query Meta's authoritative name servers (a.ns.facebook.com, b.ns.facebook.com, etc.) and receive timeouts. facebook.com, instagram.com, and whatsapp.com become unresolvable. Users begin reporting failures.
4
~16:00 UTC — Internal alerts fire; remote access compromised
Meta's monitoring systems detect the outage, but incident response tools themselves depend on services that are also offline. Remote engineers lose access to diagnostic and management tools. Internal communication via internet-based systems is also affected.
5
~16:30–18:00 UTC — Physical access to data centers
Teams are physically dispatched to data centers. The process is slow: physical access control systems (badge readers) partially depend on affected services. Technicians require manual emergency authorizations to enter some facilities.
6
~18:00–20:00 UTC — Diagnosis and start of recovery
With physical access established, engineers identify the root cause: disabled BGP sessions on backbone routers. They begin the manual process of re-enabling BGP sessions. Recovery is deliberately gradual to avoid sudden load surge on systems coming back online.
7
~20:00–21:45 UTC — Gradual service restoration
BGP routes are progressively re-added. Authoritative DNS servers become reachable again. Recursive resolvers worldwide begin resolving Meta's domains again. Services are restored in a staggered fashion to prevent thundering herd.
8
~21:45 UTC — Services fully restored
All Meta services report normal operation. Total incident duration: approximately 6 hours.

Failure Flow: BGP, DNS, and the Cascading Collapse

The diagram reconstructs the failure flow: how the withdrawal of BGP routes from the backbone made authoritative DNS servers unreachable, which in turn made all Meta services unresolvable to the entire world.

🌐 Internet / External World

End User · Browser / App
ISP Recursive · DNS Resolver
Global BGP · Routing Table

🏢 Meta Edge / PoPs

Edge Routers · (BGP Peers)
Backbone Routers · (BGP Sessions DISABLED ❌)

🔧 Meta Internal — Control Plane

Network Mgmt Tool · (Maintenance Cmd ⚠️)
Engineer · (Issued Command)

📡 Meta Internal — Data Plane

Authoritative DNS · (a/b.ns.facebook.com) 🔴
Facebook App · Servers 🔴
Instagram App · Servers 🔴
WhatsApp App · Servers 🔴
Internal Tools · & Monitoring 🔴

Root Cause: Fatal Coupling Between Control Plane and Data Plane

The root cause was not just the incorrect command — it was a systemic design failure across three overlapping layers: 1. Absence of dry-run and blast radius validation: The network management tool allowed issuing commands affecting all backbone routers without any prior simulation mechanism, staged confirmation, or scope limiting. A command that should have been surgical could be — and was — global. 2. Authoritative DNS inside the affected perimeter: Meta's authoritative name servers (a.ns.facebook.com, b.ns.facebook.com) were advertised via the same BGP prefixes that were withdrawn. This created a fatal circular dependency: to reach any Meta service, the world needed to resolve DNS; to resolve DNS, it needed to reach the authoritative servers; to reach the authoritative servers, it needed the BGP routes that had just disappeared. 3. Recovery control plane dependent on the compromised data plane: The tools engineers would need to diagnose and fix the problem — monitoring systems, remote access, internal communication — depended on the same infrastructure that was offline. The recovery system was coupled to the system that failed.

The Authoritative DNS Trap: An Invisible Circular Dependency

One of the most technically interesting aspects of this incident is the specific role of DNS — and why it could not recover automatically even when parts of the infrastructure might have been restored more quickly.

DNS works in layers. When a recursive resolver (like Google's 8.8.8.8 or your ISP's resolver) needs to resolve facebook.com, it first queries root servers to find who is authoritative for .com, then queries TLD servers to find who is authoritative for facebook.com, and finally queries Meta's name servers directly. Those name servers (a.ns.facebook.com, b.ns.facebook.com, etc.) are the ones that return the actual IP addresses of application servers.

The problem is that Meta's authoritative name servers are IP addresses within Meta's own prefixes — the same ones withdrawn from BGP. When a resolver tried to reach them, packets simply had no route. This wasn't a DNS failure in the sense of "the server responded with an error" — it was a network reachability failure. The server was there, running, but no packets could reach it.

This has an important implication for DNS TTL (Time To Live). Meta's DNS records had relatively short TTLs — in the range of minutes. This means recursive resolvers worldwide quickly discarded their cached entries and attempted revalidation. Every revalidation attempt resulted in a timeout. The result was a storm of DNS queries failing globally, with no fallback possibility, because there were no alternative authoritative servers outside the affected perimeter.

A resilience practice that would have mitigated this would be maintaining authoritative DNS servers on separate ASNs (Autonomous System Numbers), with independent BGP prefixes, advertised by different transit providers. This is exactly what large managed DNS operators (like Cloudflare, AWS Route 53, NS1) do by design — they distribute their authoritative servers across multiple ASNs so that the failure of one doesn't compromise domain resolvability.

Remediation: Why It Took 6 Hours to Fix a Command

The natural question when hearing about this incident is: "if the problem was a command that disabled BGP sessions, why wasn't it enough to just re-enable the BGP sessions?" The answer reveals the second layer of systemic failure: the recovery control plane was inside the compromised perimeter.

Meta operates infrastructure of extraordinary scale — dozens of data centers, hundreds of points of presence, thousands of routers. Managing this infrastructure is done through sophisticated internal tools, accessible via corporate networks that, in turn, depend on the same backbone systems that were offline. When the backbone went down, remote engineers lost access to network management tools. It wasn't possible to simply open a terminal and re-enable BGP sessions — the tools to do so were inaccessible.

The solution was to physically dispatch engineers to data centers. This introduced significant delays for several reasons: the logistics of mobilizing people to physical locations takes time; some data centers are in remote or controlled-access locations; and, ironically, physical access control systems (badge readers, door authentication systems) also partially depended on services that were offline, requiring manual emergency authorization procedures.

Once physical access was established, engineers could connect directly to routers via console and manually re-enable BGP sessions. But even at that point, recovery had to be carefully staged. When Meta's services became reachable again, billions of clients (mobile apps, browsers, third-party systems) attempted to reconnect simultaneously. An abrupt restoration could generate a thundering herd — an avalanche of requests that would overwhelm systems still recovering. The team opted for a gradual restoration, progressively re-adding BGP prefixes and monitoring load at each step.

This experience led Meta to announce structural changes: clearer separation between the management control plane and production infrastructure, creation of out-of-band emergency access channels that don't depend on production services, and revision of validation processes for high-impact commands on network infrastructure.

Technical Lessons from the Incident

Blast radius must be a first-class constraint in infrastructure tooling. Any tool that can affect multiple systems simultaneously must have mandatory scope mechanisms, dry-run simulation, and staged confirmation. The cost of implementing these safeguards is trivial compared to the cost of a global incident.

Authoritative DNS must never be a single point of failure and must reside on independent ASNs. Hosting authoritative name servers within the same BGP prefixes that depend on them creates a circular dependency. The industry standard solution is distributing authoritative servers across multiple ASNs with different transit providers.

The recovery control plane must be independent of the production data plane. If the tools needed to recover a system depend on that same system, recovery becomes exponentially harder. Out-of-band access channels (dedicated management networks, serial/console access, VPNs with independent uplinks) are resilience requirements, not luxuries.

Physical access control systems are critical infrastructure and must have offline fallback. Badge readers and door authentication systems that depend on cloud network services create an unexpected failure point in total outage scenarios. These systems must function in degraded offline mode.

Short DNS TTLs amplify the impact of reachability outages. Short TTLs are good for operational agility (fast change propagation), but in a reachability failure scenario, they guarantee that caches expire quickly and every resolver in the world starts making queries that will fail. There should be an emergency TTL policy and procedures to increase TTLs before high-risk maintenance.

Disaster recovery tests must include "tools offline" scenarios. Recovery runbooks frequently assume that diagnostic and remediation tools are available. Simulating scenarios where those tools are also unavailable — and practicing recovery via physical or out-of-band access — is essential for critical-scale organizations.

Architect's Perspective: What I Would Do Differently

Senior Solutions Architect

I've worked with financial systems where a poorly executed maintenance window can have severe regulatory and reputational consequences. The Meta incident resonates because the failure pattern is universal: infrastructure tooling that doesn't have the concept of "maximum impact scope" built in as a hardware-level constraint, not just software. If I were to redesign Meta's network management system based on what we know now, I would prioritize three structural changes: First, mandatory control plane separation. The infrastructure management network must be physically separated from the production network — not just logically separated via VLANs or VRFs, but with independent transit uplinks and distinct ASNs. This is standard in serious telecommunications operators and should be standard in any company that operates its own backbone. The cost is real, but it's a fraction of the cost of a single incident of this magnitude. Second, authoritative DNS on truly independent ASNs. It's not enough to have DNS servers in different data centers if they're all advertised by the same AS. Meta should have (and probably does now) authoritative servers advertised by separate ASNs with distinct transit providers — exactly as Cloudflare, AWS Route 53, and other major DNS operators do. This isn't paranoia; it's basic resilience engineering for an asset as critical as DNS.

Verdict: A Design Failure, Not an Operational One

The Meta incident of October 2021 is frequently described as "human error" — and technically, a human issued a command that caused the problem. But that description is incomplete and, in my view, dangerous, because it diverts attention from the systemic failures that made it possible for a single command to take down the entire company. The real failure was a design failure at three levels: (1) a network management tool without blast radius constraints; (2) authoritative DNS servers within the same BGP perimeter they depended on, creating a circular dependency; and (3) a recovery control plane coupled to the data plane that failed. Any one of these failures individually would have been manageable. All three together transformed an operational error into a six-hour global catastrophe. The central lesson for architects and senior engineers is this: in critical-scale systems, the question is not "can we prevent all human errors?" — the answer is always no. The question is "when an inevitable human error occurs, what is the maximum possible blast radius, and are we comfortable with it?" If the answer is "it can take everything down," the design needs to change.

References

Meta Engineering — More details about the October 4 outage

#bgp#dns#postmortem#networking#meta#facebook#outage#infrastructure

Case sources

Meta Engineering — More details about the October 4 outage

Liked this study? Get the next one.

Post-mortems, ADRs and architecture deep dives in your inbox — the way an architect reads them.

No spam · unsubscribe anytime

Written with AI assistance from the public case and my architect's reading.

Ask Fernando about this

Get a focused answer about this study from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

What Happened: From Audit to Collapse

Incident Timeline

~15:51 UTC — Maintenance command issued

Engineer executes internal network management tool for backbone capacity audit. The command, due to a bug or misconfiguration, instructs routers to disable BGP sessions across the entire spine.

~15:51–15:53 UTC — Cascading BGP route withdrawal

Within approximately two minutes, all Meta IP prefixes are withdrawn from global BGP. Routers worldwide stop forwarding traffic to Meta infrastructure. DNS entry TTLs begin expiring in recursive resolvers.

~15:53 UTC — DNS fails globally

Recursive DNS resolvers worldwide attempt to query Meta's authoritative name servers (a.ns.facebook.com, b.ns.facebook.com, etc.) and receive timeouts. facebook.com, instagram.com, and whatsapp.com become unresolvable. Users begin reporting failures.

~16:00 UTC — Internal alerts fire; remote access compromised

Meta's monitoring systems detect the outage, but incident response tools themselves depend on services that are also offline. Remote engineers lose access to diagnostic and management tools. Internal communication via internet-based systems is also affected.

~16:30–18:00 UTC — Physical access to data centers

Teams are physically dispatched to data centers. The process is slow: physical access control systems (badge readers) partially depend on affected services. Technicians require manual emergency authorizations to enter some facilities.

~18:00–20:00 UTC — Diagnosis and start of recovery

With physical access established, engineers identify the root cause: disabled BGP sessions on backbone routers. They begin the manual process of re-enabling BGP sessions. Recovery is deliberately gradual to avoid sudden load surge on systems coming back online.

~20:00–21:45 UTC — Gradual service restoration

BGP routes are progressively re-added. Authoritative DNS servers become reachable again. Recursive resolvers worldwide begin resolving Meta's domains again. Services are restored in a staggered fashion to prevent thundering herd.

~21:45 UTC — Services fully restored

All Meta services report normal operation. Total incident duration: approximately 6 hours.