Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Post-mortemCloudflareDados/Resiliência

Cloudflare (2026): when a single-AZ dependency takes down an 'HA' cluster

Feb 20, 2026 9 min AI-assisted

Listen to study

generated on play

Generated only on first play

On demand

0:000:00

Speed

The MP3 is saved to S3 after the first play.

On February 20, 2026, Cloudflare experienced a control plane and analytics outage because Kafka and ClickHouse — critical data ingestion and query services — existed only in zone PDX-04, while the cluster declared as highly available depended on them implicitly. The incident exposes a recurring pattern in distributed systems: the illusion of HA created by partial redundancy that does not cover the full dependency chain.

Incident Facts

Company: Cloudflare
Date: February 20, 2026
Total duration: Several hours (extended partial degradation)
Affected systems: Control plane, log pipeline, analytics (ClickHouse), event ingestion (Kafka)
Failure zone: PDX-04 (Portland, Oregon — Cloudflare internal datacenter)
Stack involved: Kafka, ClickHouse, internal control plane services, observability pipelines
Network traffic impact: Cloudflare's global network traffic was not affected — data plane remained operational
Root cause: Implicit dependency on single-AZ services (Kafka and ClickHouse in PDX-04) by a cluster declared as HA
Public source: Official post-mortem published on Cloudflare's blog

High availability is not a property you declare — it is a property you prove at every link in the dependency chain. Cloudflare's February 2026 outage is a precise, instructive case of how a multi-node cluster, operated with every good intention of redundancy, can have its SLA completely nullified by a single supporting service that was never replicated across zones.

What happened

On February 20, 2026, Cloudflare recorded a failure in internal zone PDX-04, located in Portland, Oregon. PDX-04 is one of the internal datacenters Cloudflare operates to support its control infrastructure — distinct from the global edge network that delivers HTTP traffic, DNS, and security to its customers.

The zone failure itself was not the central problem. The problem was what lived exclusively in that zone: Kafka instances responsible for log event ingestion, and ClickHouse instances responsible for storing and serving analytics queries. These two systems had no replicas in other zones. They were, in practice, single-AZ — despite being part of an architecture that, on paper, was described as highly available.

When PDX-04 became unavailable, the control plane services that depended on Kafka to publish events and on ClickHouse to query analytics data simply stopped working or entered error states. Cloudflare's global data plane — which processes real customer traffic — continued operating normally, as it is architecturally independent. But the ability to observe, control, and analyze that traffic was compromised.

What makes this incident particularly interesting from an architectural standpoint is the nature of the dependency: it was implicit. The HA cluster services apparently had no explicit documentation stating they depended on single-AZ components. The dependency mapping had not captured this relationship. And resilience tests — if they existed — had not simulated the isolated loss of PDX-04.

Timeline

1
T+0 — PDX-04 failure
Internal zone PDX-04 becomes unavailable. The exact causes of the zone failure are internal, but the cascade effect begins immediately in services that depend on it.
2
T+minutes — Kafka and ClickHouse become unreachable
Kafka brokers and ClickHouse nodes, all resident in PDX-04, stop responding. There is no automatic failover because there are no replicas in other zones to take over.
3
T+minutes — Control plane services begin to fail
HA cluster services that publish logs via Kafka and query analytics via ClickHouse begin returning errors or getting stuck in timeouts. The control plane — responsible for configurations, dashboards, and observability — degrades.
4
T+minutes to hours — Detection and triage
Teams identify that the global data plane is healthy. Investigation focuses on control and analytics services. The dependency on PDX-04 is identified as the common failure link.
5
T+hours — PDX-04 restoration or alternative routing
The zone is recovered or services are routed to bypass the dependency. Kafka and ClickHouse pipelines resume operation. Control plane and analytics return to normal state.
6
Post-incident — Post-mortem publication
Cloudflare publishes a blameless analysis detailing the root cause, impact, and planned corrective actions, including multi-zone replication for Kafka and ClickHouse and dependency mapping review.

Failure Flow: Single-AZ Dependency Behind an HA Cluster

The diagram shows how control plane services, distributed across multiple zones (appearance of HA), depended on Kafka and ClickHouse existing only in PDX-04. The single-zone failure propagated upward through the chain, taking down observability and analytics despite partial redundancy.

🌐 Global Edge (não afetado / unaffected)

Cloudflare Edge · Data Plane

🏢 Plano de Controle HA (multi-zona / multi-zone)

Control Plane · Zona A
Control Plane · Zona B
Control Plane · PDX-04

💥 PDX-04 — Single-AZ (ponto de falha / failure point)

Kafka · Log Ingestion · ⚠️ single-AZ
ClickHouse · Analytics Store · ⚠️ single-AZ
PDX-04 · Zone Infra · ❌ FALHOU

👤 Consumidores / Consumers

Dashboards · & Analytics UI
Log Pipeline · Consumers

Root Cause: The Illusion of HA Through Implicit Single-AZ Dependency

The control plane cluster was multi-zone at the compute layer, but depended on Kafka and ClickHouse that existed only in PDX-04. This dependency was implicit — not explicitly documented in the system's dependency mapping. The result: the cluster's HA guarantee was false. Any failure in PDX-04 would collapse the entire log ingestion and analytics capability, regardless of how many compute nodes were healthy in other zones. Declared HA without verified HA in supporting dependencies is just partial availability with a misleading label.

The Structural Problem: Hidden Dependencies and the Limits of Declared HA

There is a critical distinction between declared HA and verified HA. Declared HA is what appears in architecture documentation: "our cluster has N nodes in M zones". Verified HA is what you get when you map all transitive dependencies of each service and confirm that each of them also satisfies the same availability SLA.

In Cloudflare's case, the control plane cluster satisfied declared HA. But Kafka and ClickHouse — which are critical supporting dependencies, not optional ones — were single-AZ. This creates what I call HA with an asterisk: the system survives individual compute node failures, but collapses completely if the underlying data service fails.

This pattern is surprisingly common. It appears when:

Infrastructure services grow organically: Kafka and ClickHouse were likely initially provisioned as smaller supporting services, without the same resilience rigor applied to primary production services. Over time, the dependency grew, but the criticality classification was not revisited.

Dependency mapping is not maintained: In rapidly evolving systems, it is common for the real dependency graph to diverge from the documented one. Without automatic dependency discovery tools or structured periodic reviews, these gaps accumulate silently.

Resilience testing is incomplete: Chaos engineering and game days tend to test failures in services we already know are critical. Supporting dependencies — like observability pipelines — are often excluded because they are considered "not business-critical". But when the control plane fails, the distinction between critical and non-critical collapses.

There is also a coupling between data plane and observability plane dimension worth noting. In this incident, Cloudflare's global traffic continued flowing — the data plane was healthy. But the ability to observe, diagnose, and potentially intervene in that traffic was compromised. In security or incident response scenarios, this loss of visibility can turn a manageable problem into a critical one.

Remediation and Corrective Actions

Cloudflare's post-mortem identifies corrective actions across several dimensions. I analyze each from the perspective of someone who has implemented similar systems:

1. Multi-zone replication for Kafka and ClickHouse

The most direct action: ensure Kafka brokers and ClickHouse nodes exist in multiple zones, with replication configured between them. For Kafka, this means configuring min.insync.replicas and replication.factor so that losing a zone does not result in data loss or write unavailability. For ClickHouse, it means configuring replicas in distinct zones with synchronous or asynchronous replication depending on the acceptable consistency SLA.

The practical challenge here is cost and latency. ClickHouse with synchronous cross-zone replication introduces write latency. The decision between asynchronous and synchronous replication is a real trade-off that needs to be explicitly documented — not assumed.

2. Dependency audit and mapping

Identify all services that depend on single-AZ components and classify them by criticality. This sounds simple, but in organizations at Cloudflare's scale, the dependency graph can have hundreds of services. The practical approach I recommend is a combination of:

Distributed tracing (like Jaeger or Zipkin) for automatic runtime dependency discovery
Structured architecture reviews before any service promotion to production
A dependency registry maintained as code (dependency manifests), audited periodically

3. Isolation between data plane and observability plane

This is the most sophisticated lesson from the incident. The observability plane — logs, metrics, analytics — should not share infrastructure dependencies with the data plane it monitors. If the log pipeline uses the same Kafka as the control plane, a failure in that Kafka simultaneously blinds the operational system and the diagnostic system.

The architectural solution is to treat the observability plane as a system of equal or higher criticality than the data plane, with its own set of isolated dependencies. In AWS, for example, this means log pipelines in separate accounts, with strictly controlled cross-account IAM and no shared infrastructure dependencies.

4. Chaos engineering with expanded scope

Explicitly include supporting dependencies — Kafka, ClickHouse, cache systems, ETL pipelines — in chaos engineering scenarios. The question that should guide experiment design is not "what happens if this service fails?" but "what happens if any component this service depends on, directly or transitively, fails?"

Technical Lessons

HA is a transitive property: a cluster is only as available as its least available dependency. Declaring HA at the compute layer without verifying data and messaging dependencies is a false guarantee.

Implicit dependencies are resilience technical debt: if a dependency is not documented in the system's criticality mapping, it is a time bomb. Dependency mapping must be treated as a production artifact, not optional documentation.

Isolate the observability plane from the data plane: log and analytics pipelines must not share infrastructure with the systems they monitor. Losing observability during an incident is the worst time to go blind.

Kafka and ClickHouse in production require explicit multi-AZ configuration: it is not enough to provision multiple nodes — you must configure replication factor, min.insync.replicas (Kafka) and cross-zone replicas (ClickHouse) with documented latency and cost trade-offs.

Chaos engineering must cover supporting dependencies: resilience tests that cover only primary services leave infrastructure dependencies — messaging, analytics, cache — as failure blind spots.

The data plane surviving does not mean the incident is minor: when the control plane and observability fail, the ability to respond to security or operational incidents is severely compromised, even if traffic continues flowing.

My Perspective: The Problem of HA with an Asterisk

Senior Solutions Architect

I have seen this pattern in financial systems, in e-commerce platforms, and now at Cloudflare. The mechanism is always the same: a service earns the 'high availability' label because the compute layer was correctly replicated, but nobody audited the supporting dependencies with the same rigor. What would I do differently? First, I would treat dependency mapping as a production artifact with a defined owner and mandatory review on every significant release. Not as documentation — as an internal SLA contract. Second, I would apply the principle that any service that feeds the observability plane must have resilience equal to or greater than the service it observes. This sounds obvious stated this way, but in practice it is frequently ignored because log pipelines are seen as 'supporting infrastructure', not critical components. Third — and this is what matters most — I would explicitly split the chaos budget: half of chaos engineering experiments on supporting dependencies (Kafka, ClickHouse, Redis, ETL pipelines), not just on business services. Most organizations do the opposite. The Cloudflare incident is blameless in the correct sense: it is not the failure of a person, it is the failure of a resilience verification process that did not cover transitive dependencies. The fix is systemic, not individual.

Verdict: Verified HA, Not Declared HA

Cloudflare's February 2026 outage is a textbook case of how the illusion of high availability forms and unravels. The system had real redundancy at the compute layer. But Kafka and ClickHouse — the services that gave meaning to that computation, by storing and serving logs and analytics — existed in a single zone. When that zone failed, the compute redundancy became irrelevant. The central lesson is not technical in the narrow sense — it is not about configuring replication.factor=3 in Kafka, although that is necessary. The lesson is about process and culture: the availability guarantee of a system must be verified across all its transitive dependencies, not just the components that appear in the main architecture diagram. This requires three things that most organizations do not do systematically: (1) dependency mapping maintained as a living artifact, not static documentation; (2) resilience tests that include supporting dependencies with the same rigor as primary services; (3) explicit isolation between the data plane and the observability plane, so that an operational failure does not simultaneously blind the system and its diagnostics. Cloudflare deserves credit for the transparency of the post-mortem. This level of openness about internal failures is rare and valuable for the engineering community. The incident itself is fixable — and the actions described in the post-mortem are the right ones. What matters now is that the resilience verification process is institutionalized, not just that these two specific services are replicated.

References

Cloudflare — Post-mortem on the Control Plane and Analytics Outage

#postmortem#cloudflare#single-az#kafka#clickhouse#resiliência#observabilidade#dependências ocultas

Case sources

Cloudflare — Post-mortem on the Control Plane and Analytics Outage

Written with AI assistance from the public case and my architect's reading.

Ask Fernando about this

Get a focused answer about this study from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

What happened

Timeline

T+0 — PDX-04 failure

Internal zone PDX-04 becomes unavailable. The exact causes of the zone failure are internal, but the cascade effect begins immediately in services that depend on it.

T+minutes — Kafka and ClickHouse become unreachable

Kafka brokers and ClickHouse nodes, all resident in PDX-04, stop responding. There is no automatic failover because there are no replicas in other zones to take over.

T+minutes — Control plane services begin to fail

HA cluster services that publish logs via Kafka and query analytics via ClickHouse begin returning errors or getting stuck in timeouts. The control plane — responsible for configurations, dashboards, and observability — degrades.

T+minutes to hours — Detection and triage

Teams identify that the global data plane is healthy. Investigation focuses on control and analytics services. The dependency on PDX-04 is identified as the common failure link.

T+hours — PDX-04 restoration or alternative routing

The zone is recovered or services are routed to bypass the dependency. Kafka and ClickHouse pipelines resume operation. Control plane and analytics return to normal state.

Post-incident — Post-mortem publication

Cloudflare publishes a blameless analysis detailing the root cause, impact, and planned corrective actions, including multi-zone replication for Kafka and ClickHouse and dependency mapping review.