Cloudflare (2026): when a single-AZ dependency takes down an 'HA' cluster
Listen to study
generated on playGenerated only on first play
Powered by Amazon Polly + OmniVoice
On February 20, 2026, Cloudflare experienced a control plane and analytics outage because Kafka and ClickHouse — critical data ingestion and query services — existed only in zone PDX-04, while the cluster declared as highly available depended on them implicitly. The incident exposes a recurring pattern in distributed systems: the illusion of HA created by partial redundancy that does not cover the full dependency chain.
Incident Facts
- Company
- Cloudflare
- Date
- February 20, 2026
- Total duration
- Several hours (extended partial degradation)
- Affected systems
- Control plane, log pipeline, analytics (ClickHouse), event ingestion (Kafka)
- Failure zone
- PDX-04 (Portland, Oregon — Cloudflare internal datacenter)
- Stack involved
- Kafka, ClickHouse, internal control plane services, observability pipelines
- Network traffic impact
- Cloudflare's global network traffic was not affected — data plane remained operational
- Root cause
- Implicit dependency on single-AZ services (Kafka and ClickHouse in PDX-04) by a cluster declared as HA
- Public source
- Official post-mortem published on Cloudflare's blog
High availability is not a property you declare — it is a property you prove at every link in the dependency chain. Cloudflare's February 2026 outage is a precise, instructive case of how a multi-node cluster, operated with every good intention of redundancy, can have its SLA completely nullified by a single supporting service that was never replicated across zones.
What happened
On February 20, 2026, Cloudflare recorded a failure in internal zone PDX-04, located in Portland, Oregon. PDX-04 is one of the internal datacenters Cloudflare operates to support its control infrastructure — distinct from the global edge network that delivers HTTP traffic, DNS, and security to its customers.
The zone failure itself was not the central problem. The problem was what lived exclusively in that zone: Kafka instances responsible for log event ingestion, and ClickHouse instances responsible for storing and serving analytics queries. These two systems had no replicas in other zones. They were, in practice, single-AZ — despite being part of an architecture that, on paper, was described as highly available.
When PDX-04 became unavailable, the control plane services that depended on Kafka to publish events and on ClickHouse to query analytics data simply stopped working or entered error states. Cloudflare's global data plane — which processes real customer traffic — continued operating normally, as it is architecturally independent. But the ability to observe, control, and analyze that traffic was compromised.
What makes this incident particularly interesting from an architectural standpoint is the nature of the dependency: it was implicit. The HA cluster services apparently had no explicit documentation stating they depended on single-AZ components. The dependency mapping had not captured this relationship. And resilience tests — if they existed — had not simulated the isolated loss of PDX-04.
Timeline
- 1
T+0 — PDX-04 failure
Internal zone PDX-04 becomes unavailable. The exact causes of the zone failure are internal, but the cascade effect begins immediately in services that depend on it.
- 2
T+minutes — Kafka and ClickHouse become unreachable
Kafka brokers and ClickHouse nodes, all resident in PDX-04, stop responding. There is no automatic failover because there are no replicas in other zones to take over.
- 3
T+minutes — Control plane services begin to fail
HA cluster services that publish logs via Kafka and query analytics via ClickHouse begin returning errors or getting stuck in timeouts. The control plane — responsible for configurations, dashboards, and observability — degrades.
- 4
T+minutes to hours — Detection and triage
Teams identify that the global data plane is healthy. Investigation focuses on control and analytics services. The dependency on PDX-04 is identified as the common failure link.
- 5
T+hours — PDX-04 restoration or alternative routing
The zone is recovered or services are routed to bypass the dependency. Kafka and ClickHouse pipelines resume operation. Control plane and analytics return to normal state.
- 6
Post-incident — Post-mortem publication
Cloudflare publishes a blameless analysis detailing the root cause, impact, and planned corrective actions, including multi-zone replication for Kafka and ClickHouse and dependency mapping review.
Failure Flow: Single-AZ Dependency Behind an HA Cluster
The diagram shows how control plane services, distributed across multiple zones (appearance of HA), depended on Kafka and ClickHouse existing only in PDX-04. The single-zone failure propagated upward through the chain, taking down observability and analytics despite partial redundancy.
- Cloudflare Edge · Data Plane
- Control Plane · Zona A
- Control Plane · Zona B
- Control Plane · PDX-04
- Kafka · Log Ingestion · ⚠️ single-AZ
- ClickHouse · Analytics Store · ⚠️ single-AZ
- PDX-04 · Zone Infra · ❌ FALHOU
- Dashboards · & Analytics UI
- Log Pipeline · Consumers
Root Cause: The Illusion of HA Through Implicit Single-AZ Dependency
The control plane cluster was multi-zone at the compute layer, but depended on Kafka and ClickHouse that existed only in PDX-04. This dependency was implicit — not explicitly documented in the system's dependency mapping. The result: the cluster's HA guarantee was false. Any failure in PDX-04 would collapse the entire log ingestion and analytics capability, regardless of how many compute nodes were healthy in other zones. Declared HA without verified HA in supporting dependencies is just partial availability with a misleading label.
The Structural Problem: Hidden Dependencies and the Limits of Declared HA
There is a critical distinction between declared HA and verified HA. Declared HA is what appears in architecture documentation: "our cluster has N nodes in M zones". Verified HA is what you get when you map all transitive dependencies of each service and confirm that each of them also satisfies the same availability SLA.
In Cloudflare's case, the control plane cluster satisfied declared HA. But Kafka and ClickHouse — which are critical supporting dependencies, not optional ones — were single-AZ. This creates what I call HA with an asterisk: the system survives individual compute node failures, but collapses completely if the underlying data service fails.
This pattern is surprisingly common. It appears when:
- Infrastructure services grow organically: Kafka and ClickHouse were likely initially provisioned as smaller supporting services, without the same resilience rigor applied to primary production services. Over time, the dependency grew, but the criticality classification was not revisited.
- Dependency mapping is not maintained: In rapidly evolving systems, it is common for the real dependency graph to diverge from the documented one. Without automatic dependency discovery tools or structured periodic reviews, these gaps accumulate silently.
- Resilience testing is incomplete: Chaos engineering and game days tend to test failures in services we already know are critical. Supporting dependencies — like observability pipelines — are often excluded because they are considered "not business-critical". But when the control plane fails, the distinction between critical and non-critical collapses.
There is also a coupling between data plane and observability plane dimension worth noting. In this incident, Cloudflare's global traffic continued flowing — the data plane was healthy. But the ability to observe, diagnose, and potentially intervene in that traffic was compromised. In security or incident response scenarios, this loss of visibility can turn a manageable problem into a critical one.
Remediation and Corrective Actions
Cloudflare's post-mortem identifies corrective actions across several dimensions. I analyze each from the perspective of someone who has implemented similar systems:
1. Multi-zone replication for Kafka and ClickHouse
The most direct action: ensure Kafka brokers and ClickHouse nodes exist in multiple zones, with replication configured between them. For Kafka, this means configuring min.insync.replicas and replication.factor so that losing a zone does not result in data loss or write unavailability. For ClickHouse, it means configuring replicas in distinct zones with synchronous or asynchronous replication depending on the acceptable consistency SLA.
The practical challenge here is cost and latency. ClickHouse with synchronous cross-zone replication introduces write latency. The decision between asynchronous and synchronous replication is a real trade-off that needs to be explicitly documented — not assumed.
2. Dependency audit and mapping
Identify all services that depend on single-AZ components and classify them by criticality. This sounds simple, but in organizations at Cloudflare's scale, the dependency graph can have hundreds of services. The practical approach I recommend is a combination of:
- Distributed tracing (like Jaeger or Zipkin) for automatic runtime dependency discovery
- Structured architecture reviews before any service promotion to production
- A dependency registry maintained as code (dependency manifests), audited periodically
3. Isolation between data plane and observability plane
This is the most sophisticated lesson from the incident. The observability plane — logs, metrics, analytics — should not share infrastructure dependencies with the data plane it monitors. If the log pipeline uses the same Kafka as the control plane, a failure in that Kafka simultaneously blinds the operational system and the diagnostic system.
The architectural solution is to treat the observability plane as a system of equal or higher criticality than the data plane, with its own set of isolated dependencies. In AWS, for example, this means log pipelines in separate accounts, with strictly controlled cross-account IAM and no shared infrastructure dependencies.
4. Chaos engineering with expanded scope
Explicitly include supporting dependencies — Kafka, ClickHouse, cache systems, ETL pipelines — in chaos engineering scenarios. The question that should guide experiment design is not "what happens if this service fails?" but "what happens if any component this service depends on, directly or transitively, fails?"
Technical Lessons
I have seen this pattern in financial systems, in e-commerce platforms, and now at Cloudflare. The mechanism is always the same: a service earns the 'high availability' label because the compute layer was correctly replicated, but nobody audited the supporting dependencies with the same rigor. What would I do differently? First, I would treat dependency mapping as a production artifact with a defined owner and mandatory review on every significant release. Not as documentation — as an internal SLA contract. Second, I would apply the principle that any service that feeds the observability plane must have resilience equal to or greater than the service it observes. This sounds obvious stated this way, but in practice it is frequently ignored because log pipelines are seen as 'supporting infrastructure', not critical components. Third — and this is what matters most — I would explicitly split the chaos budget: half of chaos engineering experiments on supporting dependencies (Kafka, ClickHouse, Redis, ETL pipelines), not just on business services. Most organizations do the opposite. The Cloudflare incident is blameless in the correct sense: it is not the failure of a person, it is the failure of a resilience verification process that did not cover transitive dependencies. The fix is systemic, not individual.
Verdict: Verified HA, Not Declared HA
Cloudflare's February 2026 outage is a textbook case of how the illusion of high availability forms and unravels. The system had real redundancy at the compute layer. But Kafka and ClickHouse — the services that gave meaning to that computation, by storing and serving logs and analytics — existed in a single zone. When that zone failed, the compute redundancy became irrelevant.
The central lesson is not technical in the narrow sense — it is not about configuring replication.factor=3 in Kafka, although that is necessary. The lesson is about process and culture: the availability guarantee of a system must be verified across all its transitive dependencies, not just the components that appear in the main architecture diagram.
This requires three things that most organizations do not do systematically: (1) dependency mapping maintained as a living artifact, not static documentation; (2) resilience tests that include supporting dependencies with the same rigor as primary services; (3) explicit isolation between the data plane and the observability plane, so that an operational failure does not simultaneously blind the system and its diagnostics.
Cloudflare deserves credit for the transparency of the post-mortem. This level of openness about internal failures is rare and valuable for the engineering community. The incident itself is fixable — and the actions described in the post-mortem are the right ones. What matters now is that the resilience verification process is institutionalized, not just that these two specific services are replicated.
Ask Fernando about this
Get a focused answer about this study from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.