Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Post-mortemDatadogObservabilidade/Resiliência

Datadog (2023): how a systemd security patch took down 5 regions simultaneously

Mar 8, 2023 11 min AI-assisted

Listen to study

generated on play

Generated only on first play

On demand

0:000:00

Speed

The MP3 is saved to S3 after the first play.

In March 2023, an automatic security update to systemd-networkd restarted the networking subsystem on tens of thousands of Datadog Kubernetes nodes simultaneously, severing Cilium-managed connectivity across multiple regions. The incident exposed the risks of uncontrolled OS auto-updates, the critical dependency on CNI plugins in the data plane, and the lack of regional isolation in update pipelines.

Incident Facts

Company: Datadog
Incident date: March 8, 2023
Total duration: ~5 hours to partial restoration; prolonged degradation in some regions
Affected regions: 5 production regions simultaneously
Affected node scale: Tens of thousands of Kubernetes nodes
Customer impact: Metrics, logs, and traces ingestion degraded or interrupted; dashboards and alerts affected across multiple products
Root component: systemd-networkd (automatic security update)
Relevant stack: Kubernetes, Cilium (CNI), systemd, Ubuntu, bare-metal/cloud nodes
Failure type: Uncontrolled infrastructure change → large-scale network connectivity loss

A routine security patch, automatically applied by the OS package manager, restarted the networking subsystem on tens of thousands of nodes simultaneously — and took down five Datadog production regions at once. There was no complex software bug, no hardware failure, no attack. Just the combination of unrestricted auto-update, a CNI plugin sensitive to network restarts, and the absence of regional isolation in the update pipeline. This post-mortem is a case study in blast radius, in the cost of operational conveniences, and in why 'works in staging' is not enough when a change hits global production in parallel.

What happened

On the morning of March 8, 2023, Datadog began receiving alerts of severe degradation across multiple products — metrics ingestion, logs, APM, and synthetics. The most disturbing characteristic of the incident was not the failure itself, but its simultaneity: five independent production regions began exhibiting problems at the same time, which immediately ruled out localized failure hypotheses and pointed to a common cross-cutting cause.

The investigation revealed that Ubuntu, the OS of the compute nodes, had automatically applied a security update to the systemd package. Specifically, the update affected systemd-networkd, the daemon responsible for managing network interfaces at the OS level. Upon being updated, systemd-networkd was restarted — expected and documented behavior for applying security patches to network components.

The critical problem lay in the interaction between this restart and Cilium, the CNI plugin (Container Network Interface) used by Datadog to manage Kubernetes pod networking. Cilium operates in kernel space via eBPF and maintains connectivity state that depends on stable network interfaces. When systemd-networkd restarted and reconfigured the interfaces, Cilium lost pod connectivity state — effectively severing communication between running workloads, without the pods themselves being restarted or reporting failure.

The result was a silent and diffuse failure: pods running, nodes apparently healthy, but with no ability to communicate. Ingestion pipelines stopped receiving data. Internal control services lost connectivity with each other. The Kubernetes control plane continued functioning, but the data plane was effectively broken.

Timeline

1
T-0: Patch released
Ubuntu publishes a security update for the systemd package, including changes to systemd-networkd. The package enters the automatic update repositories.
2
T+0: Automatic application begins
The auto-update mechanism (unattended-upgrades) begins applying the patch to production nodes. No maintenance window, no progressive staging, no regional isolation — the update is applied in parallel across multiple regions.
3
T+minutes: systemd-networkd restarts at scale
On tens of thousands of nodes, systemd-networkd is restarted as part of the patch application. Network interfaces are reconfigured by the daemon.
4
T+minutes: Cilium loses connectivity state
The systemd-networkd restart invalidates the eBPF state maintained by Cilium. Pod-to-pod connectivity is severed at scale. Pods continue running and reporting 'healthy' to kubelet, but cannot communicate.
5
T+~10-20min: Alerts fire
Metrics, logs, and traces ingestion pipelines begin failing. Latency and error alerts fire across multiple products simultaneously. The on-call team is paged.
6
T+~30min: Initial diagnosis
The team identifies that the problem is network connectivity at the node level, not application logic. Correlation with the systemd update is established. Simultaneity across 5 regions confirms the common cause.
7
T+~1-2h: Mitigation underway
Remediation process begins: restarting the Cilium agent on affected nodes, node replacement via rolling replacement, blocking new automatic updates. The scale of the problem makes recovery slow.
8
T+~5h: Partial restoration
Most regions begin recovering. Some regions show prolonged degradation while node replacement completes. Intensive monitoring is maintained.

Failure Flow: from Patch to Connectivity Blackout

The diagram reconstructs the causal path of the incident: how an OS update package traversed the auto-update pipeline, hit systemd-networkd, and propagated the failure to Cilium and consequently to all pod connectivity across multiple regions.

🌐 Ubuntu Package Repository

Ubuntu Security · Repository

🔄 Auto-Update Pipeline (sem controle de blast radius)

unattended-upgrades · (OS daemon)

🖥️ Nó Kubernetes (replicado em dezenas de milhares)

systemd-networkd · (reiniciado pelo patch)
Cilium Agent · (eBPF / CNI)
Pod A · (running, sem rede)
Pod B · (running, sem rede)
kubelet · (reporta node=Ready)

📊 Plano de Dados Datadog (afetado)

Ingestion Pipeline · (métricas/logs/traces)
Serviços Internos · (controle/coordenação)

🌍 5 Regiões (impactadas simultaneamente)

Região 1 · ❌ conectividade
Região 2 · ❌ conectividade
Regiões 3-5 · ❌ conectividade

Root Cause: the convergence of three independent design decisions

The root cause was not the bug in systemd-networkd, nor Cilium's behavior in isolation. It was the convergence of three independent design decisions that individually seemed reasonable: 1. Unrestricted auto-update: unattended-upgrades configured to automatically apply security patches, without maintenance windows, without progressive staging, and without regional isolation. The intent was legitimate — reducing CVE exposure — but the mechanism had no blast radius control. 2. Cilium's sensitivity to systemd-networkd restarts: Cilium, operating via eBPF in the kernel, maintains connectivity state that is invalidated when systemd-networkd reconfigures network interfaces. This dependency was not documented as an operational risk in OS update runbooks. 3. Absence of regional isolation in the update pipeline: There was no mechanism to ensure that an infrastructure change would be applied sequentially per region, with health metric observation between each step. The result was that all regions were hit simultaneously, eliminating any possibility of early detection and rollback before widespread impact. This is a classic failure mode of complex systems: each component behaved as designed, but their interaction produced an unanticipated catastrophic result.

Remediation and incident response

Immediate remediation involved three parallel fronts. First, the team blocked new auto-updates on all remaining nodes to prevent the problem from spreading to nodes not yet affected. Second, the recovery process for already-affected nodes began: in some cases, restarting the Cilium agent was sufficient to restore connectivity after systemd-networkd had stabilized; in others, nodes needed to be drained and replaced via rolling replacement, which is an inherently slow process when dealing with tens of thousands of nodes.

Scale was the primary obstacle to rapid recovery. In a localized incident — say, a single region with hundreds of nodes — the drain-and-replace procedure would complete in minutes. With tens of thousands of nodes distributed across five regions, the same procedure took hours. This illustrates an important point: the blast radius of a change affects not only the severity of the initial impact, but also the recovery time. The larger the blast radius, the slower and more costly the remediation.

In the medium term, Datadog implemented a set of structural controls. Unrestricted auto-update was replaced by a managed OS update pipeline, with progressive rollout per region and health metric observation between each step. The interaction between systemd-networkd and Cilium was documented and added to OS patch validation criteria. Compatibility testing between OS updates and the CNI plugin was now executed in staging before any production rollout. Additionally, faster detection mechanisms for node-level network connectivity failures were implemented — specifically to capture the pattern of 'pod running, node Ready, but no connectivity', which is a false positive from the kubelet's perspective.

Why regional independence failed

One of the fundamental principles of resilience in distributed systems is failure domain isolation. Distinct geographic regions exist precisely so that a failure in one does not propagate to others. Datadog, as an observability company, certainly had this principle embedded in its product architecture. The problem was that the infrastructure update pipeline did not respect those same boundaries.

This reveals a common asymmetry in mature engineering organizations: significant investment in failure isolation at the application plane — circuit breakers, bulkheads, retry with backoff, multi-region deployments — but the underlying infrastructure plane may have global couplings that are not treated with the same rigor. unattended-upgrades did not know it was operating in a multi-region system with blast radius requirements. It was just a system daemon doing its job.

The lesson here is that any mechanism that can modify state across multiple production nodes must be treated as a deployment, with all the guarantees that implies: canary release, progressive rollout, health gates between steps, and rollback capability. This applies to OS patches, agent updates, configuration changes via Ansible/Chef/Puppet, and any other form of infrastructure change at scale. The distinction between 'application change' and 'infrastructure change' is operational, not architectural — from a risk perspective, both require the same controls.

There is also an aspect of observability of the change process itself that deserves attention. Datadog is an observability company — its products are used by other companies to detect exactly this type of problem. The irony of the incident is that root cause detection took time because the monitoring system itself was partially degraded by the failure. This reinforces the importance of having out-of-band observability — monitoring mechanisms that do not depend on the infrastructure they are monitoring.

Technical Lessons

OS auto-update in production is a deployment in disguise: Any mechanism that applies changes to production nodes must have the same guarantees as an application deployment — progressive rollout, health gates, and rollback.

Blast radius is a design dimension, not an operational one: The impact radius of a change must be limited by design (regional isolation, canary gates) before the change is executed, not managed reactively after the incident.

Runtime dependencies between system components must be documented as operational risks: The interaction between systemd-networkd and Cilium was not a bug — it was system behavior not documented as a risk. OS update runbooks must include CNI plugin validation.

'Node=Ready' does not imply 'network working': The kubelet reported healthy nodes while pod connectivity was broken. Infrastructure health checks must include pod-level network connectivity verification, not just kubelet status.

Regional independence must extend to the infrastructure pipeline: Having independent regions at the application plane is not enough if the infrastructure plane has global couplings. OS, agent, and configuration update pipelines must respect the same failure domain boundaries.

Out-of-band observability is critical: When the monitoring infrastructure is in the same blast zone as the monitored system, diagnostic capability is compromised exactly when it is most needed. Monitoring mechanisms independent of the main infrastructure are essential.

My senior take: the real problem was not systemd

Senior Solutions Architect

When I read this post-mortem, what strikes me is not the technical failure itself — the interaction between systemd-networkd and Cilium is the kind of emergent behavior that appears in complex systems and that no unit test will capture. What strikes me is the absence of a principle that should be non-negotiable in any production system at scale: no change touches more than one failure domain at a time without intermediate observation. I have seen variations of this pattern in financial systems: a credential rotation script that ran in parallel across all environments, a deploy pipeline that did not respect the region sequence, a security library update that was applied to all services simultaneously because 'it was just a security patch'. The logic is always the same: the change seems small and safe, so the cost of doing a controlled rollout seems unnecessary. And then you discover that 'small and safe' was an assessment made without considering runtime interactions. What would I do differently? Three concrete things: 1. Treat OS patches as software releases. This means: staging environment with the same production stack (including CNI plugin, kernel version, network configuration), automated smoke tests post-update that include pod connectivity verification, and per-region rollout with a health gate of at least 30 minutes before advancing to the next region. 2. Implement a 'blast radius budget' per change window. No more than X% of a region's nodes can be updated simultaneously, and never more than one region in parallel without explicit approval. This is policy configuration, not complex engineering — but it needs to be a conscious and documented decision. 3. Separate the observability plane from the data plane. Datadog has the specific problem that its monitoring infrastructure and product infrastructure share the same node base. In financial systems, the separation between the trading system and the trading monitoring system is a regulatory requirement. For an observability company, this separation should be a first-class architectural requirement. Datadog's post-mortem is exemplary in its honesty and technical depth. But the most important lesson is not in the corrective actions — it is in the fact that all the controls implemented after the incident could have been implemented before, if the question 'what is the blast radius of this change?' were part of the standard checklist for any infrastructure modification.

Verdict: when operational convenience becomes systemic risk

Datadog's March 2023 incident was not caused by negligence or incompetence — it was caused by reasonable design decisions that were not reassessed as the system grew in scale and complexity. Security auto-update is a recommended practice. Cilium is a solid technical choice for CNI in Kubernetes. Multiple production regions are a resilience requirement. The problem was that none of these decisions were made considering the interaction of all three together, and that the infrastructure update pipeline never received the same rigorous treatment as the application deploy pipeline. The central lesson is about change governance at scale: any mechanism that can modify state across multiple production nodes — whether an application deploy, an OS patch, an agent update, or a configuration change — needs explicit blast radius controls, progressive rollout, and intermediate observation. The distinction between 'application change' and 'infrastructure change' should not exist from a risk perspective. For teams operating Kubernetes in production, this incident is a reminder that the contract between the operating system and cluster network components (CNI plugins, especially eBPF-based ones like Cilium) is more fragile than it appears. Changes to the OS network subsystem should be treated as critical infrastructure changes, tested in staging with the full production stack, and applied with controlled rollout. And for technical leaders: the question 'what is the blast radius of this change?' should be as automatic as 'does this change have tests?'. Not because incidents like this are inevitable, but because with the right controls, they are preventable.

References

Datadog — 2023-03-08 Incident Post-Mortem (official)

#postmortem#datadog#systemd#cilium#kubernetes#observability#blast-radius#auto-update

Case sources

Datadog — 2023-03-08 Incident Post-Mortem

Liked this study? Get the next one.

Post-mortems, ADRs and architecture deep dives in your inbox — the way an architect reads them.

No spam · unsubscribe anytime

Written with AI assistance from the public case and my architect's reading.

Ask Fernando about this

Get a focused answer about this study from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

What happened

Timeline

T-0: Patch released

Ubuntu publishes a security update for the systemd package, including changes to systemd-networkd. The package enters the automatic update repositories.

T+0: Automatic application begins

The auto-update mechanism (unattended-upgrades) begins applying the patch to production nodes. No maintenance window, no progressive staging, no regional isolation — the update is applied in parallel across multiple regions.

T+minutes: systemd-networkd restarts at scale

On tens of thousands of nodes, systemd-networkd is restarted as part of the patch application. Network interfaces are reconfigured by the daemon.

T+minutes: Cilium loses connectivity state

The systemd-networkd restart invalidates the eBPF state maintained by Cilium. Pod-to-pod connectivity is severed at scale. Pods continue running and reporting 'healthy' to kubelet, but cannot communicate.

T+~10-20min: Alerts fire

Metrics, logs, and traces ingestion pipelines begin failing. Latency and error alerts fire across multiple products simultaneously. The on-call team is paged.

T+~30min: Initial diagnosis

The team identifies that the problem is network connectivity at the node level, not application logic. Correlation with the systemd update is established. Simultaneity across 5 regions confirms the common cause.

T+~1-2h: Mitigation underway

Remediation process begins: restarting the Cilium agent on affected nodes, node replacement via rolling replacement, blocking new automatic updates. The scale of the problem makes recovery slow.

T+~5h: Partial restoration

Most regions begin recovering. Some regions show prolonged degradation while node replacement completes. Intensive monitoring is maintained.