Cloudflare 2019: The Regex That Took Down a Global Network
Listen to study
generated on playGenerated only on first play
On July 2, 2019, a single WAF rule containing a regex with catastrophic backtracking saturated 100% of CPU across all Cloudflare servers globally, taking down services for approximately 27 minutes. The incident exposed critical failures in the rule deployment process, the absence of protection against algorithmic complexity, and the insufficiency of the testing pipeline to detect pathological patterns.
Incident Facts
- Company / System
- Cloudflare — Global edge network (CDN, WAF, DDoS protection)
- Date
- July 2, 2019
- Duration
- ~27 minutes of total outage; rollback completed at 14:52 UTC
- Impact
- ~80% drop in traffic processed by Cloudflare; 100% CPU utilization across all global PoPs
- Failing component
- WAF XSS rule (number 100137) with catastrophic backtracking regex
- WAF Engine
- Lua WAF running on NGINX / OpenResty
- Deployment mechanism
- Simultaneous global deploy without gradual rollout (no canary/staged rollout)
- Network scale
- Presence in over 193 cities / PoPs at the time of the incident
- Root cause
- Regex with catastrophic backtracking consuming unbounded CPU per HTTP request
A poorly constructed regular expression — 18 tokens, a capture group with a nested quantifier — was enough to paralyze the world's largest edge network for 27 minutes. It was not an attack. It was a WAF rule deployment that passed code review, automated tests, and staging, yet still reached global production carrying an algorithmically explosive pattern. This post-mortem examines what happened, why the existing controls failed, and what Cloudflare — and any team operating critical infrastructure — should have done differently.
What Happened
On July 2, 2019, at 13:42 UTC, Cloudflare deployed new rules to its managed WAF. Among the rules was rule 100137, intended to detect XSS attacks. The core regex of the rule was:
(?:(?:"|'|\]|\[|\\|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||&&)*(?:[0-9]{0,4}d?(?:x|xx|xxx|xxxx)?)(?:;|\\|,|\+|\-|\*|\/|%|&|\||\^|<|>|=|!)))+
The problem lies in the group ((?:\s|-|~|!|{}|\|\||&&)*) combined with the outer + quantifier. This pattern creates what computer science calls catastrophic backtracking: for a string that does not match the pattern but is long enough, the regex engine tries exponentially more combinations before giving up. Execution time ceases to be O(n) and becomes O(2ⁿ) relative to input length.
Cloudflare was using Lua's regex engine (based on PCRE — Perl Compatible Regular Expressions), which has no native protection against this behavior. Unlike finite automaton-based engines (such as RE2 or Rust's regex crate), PCRE performs backtracking without a time limit by default.
Each HTTP request processed by the WAF began consuming 100% of a CPU core for as long as the matching attempt lasted. With real traffic arriving at hundreds of thousands of requests per second at each PoP, all cores on all servers in all data centers were saturated almost instantaneously. Cloudflare's own DDoS protection system — which depends on the same NGINX processes — stopped working. Legitimate traffic was dropped alongside malicious traffic.
Timeline
- 1
13:42 UTC — Deploy initiated
WAF rule 100137 is published globally via the standard deploy pipeline. No staged rollout. The rule goes to all PoPs simultaneously.
- 2
13:42 UTC — CPU saturates globally
Immediately after the deploy, CPU on all Cloudflare servers reaches 100%. HTTP traffic begins to be dropped. Alerts fire in monitoring systems.
- 3
13:47 UTC — First internal response
On-call engineers are paged. Initial correlation points to the WAF deploy as the likely cause, but precise diagnosis has not yet been confirmed.
- 4
~14:00 UTC — Diagnosis confirmed
The team identifies rule 100137 as the source of the problem. Regex analysis reveals the catastrophic backtracking pattern. Rollback decision is made.
- 5
14:52 UTC — Rollback complete
The problematic rule is rolled back across all PoPs. CPU returns to normal. Traffic is restored. Total outage duration: approximately 27 minutes.
- 6
Following hours — Forensic analysis
The team performs detailed regex analysis, documents the backtracking mechanism, and begins designing the process and technology changes needed to prevent recurrence.
Failure Flow: WAF Rule Deploy → Global CPU Saturation
The diagram reconstructs the path taken by the defective rule from the deploy pipeline to the impact at each global PoP, and how catastrophic backtracking propagated to saturate all processing capacity.
- Engineer · Rule Author
- WAF Rules Repo · Git + CI
- Test Suite · (unit + staging)
- Global Deploy · (simultaneous)
- NGINX / OpenResty · HTTP Worker
- Lua WAF Engine · (PCRE regex)
- CPU Core · (saturated 100%)
- DDoS Protection · (NGINX-dependent)
- HTTP Request · (legitimate traffic)
- Customer Origin · Server
Root Cause: Catastrophic Backtracking + Atomic Global Deploy
The technical root cause is the regex ((?:\s|-|~|!|{}|\|\||&&)*) with an outer + quantifier, creating matching ambiguity that forces the PCRE engine to explore exponentially more paths as the input string grows. For a 30-character non-matching string, the number of attempts can exceed billions. The systemic root cause is the combination of three absences: (1) no static complexity analysis of regexes in the CI pipeline; (2) no timeout or per-operation CPU limit in the Lua/PCRE engine; (3) no staged rollout mechanism that would have limited the blast radius to the first affected PoP.
Why Existing Controls Failed
This incident is pedagogically valuable precisely because Cloudflare had quality controls. The rule passed human code review. It passed an automated test suite. It passed a staging environment. And yet it reached global production carrying a catastrophic pattern. Why?
Human code review is effective for business logic, but detecting catastrophic backtracking in regex requires specialized knowledge in automata theory that is not universally distributed among engineers. The regex looked reasonable on visual inspection — it covered the intended XSS cases in the example tests.
Automated tests validated that the rule worked — that it detected the XSS payloads it was designed for. But the tests did not include pathological edge cases: long strings that almost match but don't, which are exactly the inputs that maximize backtracking. This is a classic gap: testing the happy path is not sufficient for security systems that process adversarial input.
Staging replicated the production configuration, but did not replicate traffic volume. With few requests per second, even a pathological regex may not be observable — the impact only becomes visible under real load. This is a common trap in performance testing: the staging environment rarely has the pressure needed to reveal algorithmic complexity problems.
Simultaneous global deploy was the architectural decision that turned a bug into a catastrophe. If Cloudflare had done a staged rollout — starting with 1% of PoPs, observing CPU metrics for 5 minutes, and only then progressing — the impact would have been contained. The absence of staged rollout for WAF changes is surprising given the industry's history with gradual deploys. The implicit justification was probably threat response speed: when you discover a new attack vector, you want immediate global protection. But that speed has a risk cost that this incident quantified painfully.
Remediation and Implemented Changes
Cloudflare published a detailed post-mortem and implemented a substantial set of changes. I analyze each with the eye of someone who has operated edge systems at financial scale:
1. Regex engine replacement (PCRE → RE2/Rust)
The most important change. RE2 and Rust's regex crate guarantee O(n) complexity for any regular expression, eliminating the possibility of catastrophic backtracking by construction. This is not a mitigation — it is an elimination of the failure class. The trade-off is that RE2 does not support some PCRE features (negative lookahead, backreferences), which may require rewriting existing rules. For a WAF, this trade-off is clearly favorable: performance predictability outweighs regex expressiveness.
2. Static complexity analysis in CI
Tools like rxxr2 or automata-theory-based regex complexity analysis were added to the pipeline. Any regex that can exhibit super-linear behavior is rejected before merge. This is correct defense-in-depth: not relying solely on human review for properties that can be automatically verified.
3. Mandatory staged rollout for WAF changes
Rule deployments now follow a gradual rollout process: a subset of PoPs first, with automatic observation of CPU and latency metrics, before global progression. This is the correct standard for any change that affects the critical path of traffic processing.
4. Per-operation CPU limit for regex
Even with RE2, a per-WAF-operation timeout was added as an additional protection layer. If an operation exceeds a time threshold, it is aborted and the request is either passed (fail-open) or blocked (fail-closed) according to the configured policy. This is defense-in-depth: multiple independent layers.
5. Test case improvement
The test suite was expanded to include pathological inputs: long strings, strings that almost match, strings with character repetition. This should have existed from the start for a system that processes adversarial input.
The set of changes is exemplary. Cloudflare did not just fix the symptom (the specific rule) but attacked the root cause at multiple layers: the engine, the deploy process, the CI pipeline, and the tests. This is how a well-executed post-mortem should result in action.
Technical Lessons
rxxr2, safe-regex (Node.js), and automata analysis can detect ReDoS patterns before deploy. There is no justification for not including this in any CI that processes regex rules in production.re, Java java.util.regex) do not guarantee linear complexity. For WAFs, parsers, and any system processing untrusted input, prefer RE2, Rust regex, or Hyperscan.I have worked with edge and WAF systems in financial services contexts, where one minute of unavailability has measurable and regulatory cost. What strikes me about this incident is not the bug itself — regex with catastrophic backtracking is a well-documented problem since the 1990s — but the absence of controls that are, frankly, basic for systems at this scale.
What I would have required before this deploy reached production:
First, static regex complexity analysis in CI is non-negotiable for any system that processes internet input. Tools like rxxr2 existed before this incident. The integration cost is hours, not weeks. The absence of this in a WAF protecting a significant fraction of the internet is a process gap that is hard to justify.
Second, the atomic global deploy concerns me more than the regex itself. A configuration bug will happen — the question is whether you have mechanisms to contain the blast radius. Staged rollout with automatic circuit breaker (if CPU > X% on the pilot PoP, stop the rollout and alert) is standard architecture for critical systems. The threat response speed justification can be addressed with a fast-path rollout (5 minutes at 1% of PoPs, not 30 minutes at 10%) without sacrificing process safety.
Third, and this is less obvious: the choice of PCRE for a high-frequency WAF engine is a risk decision that should have been revisited independently of this incident. RE2 and Hyperscan exist precisely for this use case. Migration has a cost, but the cost of not migrating was demonstrated unequivocally here.
What Cloudflare did
Verdict: An Avoidable Systemic Failure
The Cloudflare incident of July 2019 is a case study in how multiple layers of control can fail simultaneously when each was designed to verify a different property than the one that caused the problem. Code review verified logic. Tests verified functional correctness. Staging verified configuration compatibility. No layer verified worst-case algorithmic complexity under adversarial input — which is exactly the property that matters for a WAF. The central lesson is not 'be careful with regex'. It is deeper: security and performance properties of critical systems must be verified by mechanisms appropriate for each property, not assumed as a consequence of generic quality controls. Catastrophic backtracking is statically verifiable. Gradual deployment is an established practice. Operation timeouts are basic resilience engineering primitives. The absence of any of these controls in a system of Cloudflare's scale and criticality is, in retrospect, surprising. What saves the narrative is the response: rollback in 27 minutes, detailed technical post-mortem published publicly, and a set of changes that attacks the problem at multiple layers. Cloudflare turned a public failure into a