Studies
Post-mortemOpenAIKubernetes/Resiliência

OpenAI (2024): How a New Telemetry Service Took Down the Kubernetes Control Plane

Dec 11, 2024 11 min AI-assisted
Share:

Listen to study

generated on play

Generated only on first play

On demand
0:000:00
Speed
The MP3 is saved to S3 after the first play.

In December 2024, OpenAI deployed a new telemetry agent across its entire Kubernetes fleet simultaneously. The resulting overload on the API servers cascaded into a global outage of ChatGPT, the API, and Sora for several hours — and the saturated control plane itself blocked the rollback. An analysis of how control plane vs. data plane separation, gradual rollouts, and observability decoupling were not optional.

A single DaemonSet deployed without a canary. Millions of users without service. The control plane so saturated it could not process its own rollback. OpenAI's December 2024 incident is not a story about buggy software — it is a story about the absence of architectural guardrails that any production-scale system should have by default.

Incident Fact Sheet

Company / System
OpenAI — ChatGPT, Public API, Sora
Incident date
December 11, 2024
Total duration
~3 hours of severe degradation; full recovery in ~4h15
Impact
Global unavailability of ChatGPT, OpenAI API, and Sora for all users
Root cause
New telemetry agent deployed simultaneously fleet-wide overwhelmed Kubernetes API servers
Aggravating factor
DNS cache masked early impact; control plane saturation blocked the rollback
Relevant stack
Kubernetes (large-scale fleet), DaemonSet, internal DNS, API servers, etcd
Classification
P0 — Full outage, global impact

What Happened

On December 11, 2024, OpenAI's infrastructure team initiated the rollout of a new telemetry collection service. The intent was legitimate and common: improve internal observability of the Kubernetes fleet that underpins all inference infrastructure for ChatGPT, the public API, and Sora. The agent was packaged as a DaemonSet — Kubernetes' standard mechanism for running a pod on every node in the fleet — and the deployment was triggered broadly, hitting a very large number of nodes simultaneously.

The problem was not in the agent's code itself. It was in the emergent behavior of thousands of instances of that agent initializing at the same time and, in doing so, each establishing persistent connections to the Kubernetes API servers to observe cluster state — the Kubernetes API watch pattern. At OpenAI's production fleet scale, this meant hundreds or thousands of watch connections were opened within a very short time window. The API servers, which are the central control point for the entire cluster, began to saturate.

What makes this incident particularly instructive is the role of DNS caching. During the first minutes after the rollout began, the impact was not immediately visible in end-user service health indicators. The cluster's internal DNS was still resolving addresses from cached entries, meaning application traffic continued flowing normally for a period. This created a window of false safety — latency and error alerts for product services did not fire immediately, which delayed recognition that something was fundamentally wrong at the infrastructure layer.

When DNS caches began to expire and new resolutions needed to pass through Kubernetes' service discovery mechanisms (which depend on the already-saturated API servers), the impact cascaded rapidly to the data plane. Pods could not be scheduled, services could not be discovered, and user traffic began failing at global scale.

Timeline

  1. 1

    T+0 — Rollout begins

    The infrastructure team initiates deployment of the new telemetry agent as a DaemonSet across the entire Kubernetes fleet. The rollout is not gated by rings or node percentage — it is applied broadly.

  2. 2

    T+~5min — Silent control plane saturation

    Thousands of agent instances initialize simultaneously and open watch connections against the API servers. Load on API servers and etcd begins rising rapidly. Product services are not yet reporting errors — DNS cache is masking the impact.

  3. 3

    T+~15-20min — DNS cache expiry and cascade

    Cached DNS entries begin to expire. New resolutions depend on Kubernetes service discovery, which in turn depends on the saturated API servers. User traffic begins failing. P0 alerts fire. ChatGPT, API, and Sora become globally unavailable.

  4. 4

    T+~25min — Rollback attempt blocked

    The team identifies the DaemonSet as the cause and attempts to revert the deployment. The rollback itself requires Kubernetes API servers to process control operations — but they are too saturated to reliably accept new commands. The rollback fails or takes far longer than expected.

  5. 5

    T+~60-90min — Partial recovery

    As some nodes manage to process the DaemonSet rollback, pressure on the API servers begins to gradually decrease. Some services start recovering intermittently.

  6. 6

    T+~4h15 — Full recovery

    Rollback completes across the entire fleet. API servers return to normal operation. ChatGPT, API, and Sora are declared fully operational. Public post-mortem is published on the status page.

Failure Flow: Telemetry → Control Plane Saturation → Cascade

The diagram shows how the telemetry DaemonSet overwhelmed the API servers, how DNS cache masked the initial impact, and how control plane saturation blocked the recovery mechanism itself.

👤 Usuários / Clientes
  • Usuários · ChatGPT / API / Sora
🌐 Data Plane — Serviços de Produto
  • ChatGPT · Frontend
  • OpenAI API · Inferência
  • Sora · Video Gen
🔍 Observabilidade — Novo Agente (causa)
  • DaemonSet · Telemetria (novo)
  • Backend · Telemetria
⚙️ Kubernetes Control Plane
  • API Server · (saturado)
  • etcd · (sobrecarga de leitura)
  • Scheduler · (bloqueado)
  • Controller Manager · (bloqueado)
🌐 DNS Interno do Cluster
  • CoreDNS · (cache expirando)
🖥️ Nós do Cluster (frota)
  • N Nós · (frota completa)

Root Cause: The Control Plane as a Single Point of Operational Failure

The telemetry agent was deployed as a DaemonSet across the entire fleet simultaneously, without a gradual rollout. Each agent instance established persistent watch connections to the Kubernetes API servers at initialization time. The aggregate load of these connections — multiplied by fleet scale — saturated the API servers and the underlying etcd. The critical and frequently underestimated factor: the rollback mechanism itself (kubectl, deploy operators) depends on the same saturated API servers. When the control plane goes down, you simultaneously lose the data plane AND the ability to remediate. DNS cache acted as a temporal attenuator that delayed product alerts, shrinking the reaction window before full cascade.

The DNS Cache Dynamic: An Attenuator That Became a Trap

One of the most technically interesting aspects of this incident is the role of the cluster's internal DNS cache — and how it transformed a detectable problem into one that revealed itself too late for a fast response.

In a Kubernetes cluster, service name resolution is handled by CoreDNS (or equivalent). When a pod needs to communicate with another service, it resolves the service's DNS name, which points to the corresponding ClusterIP. These resolutions are cached locally in pods and intermediate resolvers with a defined TTL — typically between 5 and 30 seconds for cluster DNS, but the actual implementation depends on how ndots, search, and CoreDNS TTLs are configured.

At the moment the telemetry DaemonSet began saturating the API servers, application traffic between product services (ChatGPT, API, Sora) continued working normally because DNS resolutions were already cached. Pods did not need to query CoreDNS for each request — they already had the resolved IP addresses. This created a time window — estimated at 15 to 20 minutes based on the incident's progression — during which the control plane was severely degrading, but product SLIs (error rate, latency) remained within normal bounds.

This window of false normality is a classic trap in distributed systems: the caching layer absorbs the initial impact and masks degradation in the underlying layer. When caches expired and new resolutions needed to pass through CoreDNS — which in turn depends on Kubernetes endpoints updated by the saturated API servers — the impact became immediate and total. There was no gradual degradation visible to end users: it was an abrupt transition from "everything working" to "nothing working".

This has direct implications for alert design: monitoring only product SLIs is insufficient when there is a caching layer between the user and the infrastructure. Control plane health must be monitored independently, with alerts that fire well before the cache expires — not after.

Remediation: When the Cure Needs the Patient Awake

The remediation phase of this incident exposes one of the most dangerous properties of Kubernetes control plane failures: the system you use to fix the problem is the same system that is broken.

When the team identified the telemetry DaemonSet as the cause and attempted to execute the rollback — whether via kubectl rollout undo, via a GitOps operator like ArgoCD or Flux, or via direct API calls — all of these operations require the Kubernetes API servers to accept and process write requests. An API server saturated by thousands of active watch connections has no available processing capacity to reliably accept new control commands. Rollback requests were queued, hit timeouts, or failed entirely.

This is not a Kubernetes bug — it is an expected architectural consequence of a centralized control plane design. etcd, which persists cluster state, was also under elevated read pressure due to the volume of watch connections, which further degraded the API servers' ability to process anything.

Recovery was only possible gradually: as some nodes managed to process the DaemonSet rollback (reducing the number of active watch connections), pressure on the API servers decreased incrementally, freeing capacity to process more rollback operations. It was a self-limiting recovery process — each successful rollback enabled the next, but progress was slow because each operation competed with the remaining watch connections.

The operational lesson here is that any incident response plan for a large-scale Kubernetes environment must include remediation mechanisms that do not depend exclusively on the API servers. This can include: direct node access via SSH or SSM to manually remove DaemonSet pods; emergency scripts that operate outside the normal Kubernetes reconciliation cycle; or, in extreme cases, cordon and drain procedures that can be executed with partial control plane connectivity. None of these mechanisms replace prevention — but the absence of an alternative remediation plan turns a serious incident into a catastrophic one.

Technical Lessons Extracted

Control plane ≠ data plane: Kubernetes API server health must be monitored completely independently from product SLIs. A degraded control plane may not immediately manifest in end-user services due to intermediate caches and buffers.
DaemonSets are load multipliers: A DaemonSet deployed on N nodes generates N simultaneous instances. In fleets of hundreds or thousands of nodes, initialization behavior — especially watch connections and API calls — must be designed with rate limiting, exponential backoff, and phased initialization.
Canary-less rollouts on critical infrastructure are unacceptable: Any change affecting all nodes in a fleet must go through progressive deployment rings — 1%, 5%, 25%, 100% — with control plane health metrics as automatic gates between each ring.
DNS cache as a late-failure sensor: Cluster DNS TTL is a resilience parameter, not just a performance one. Very long TTLs increase the false normality window; very short TTLs increase load on CoreDNS and API servers. The balance point depends on your ability to detect control plane failures before the cache expires.
Rollback also needs a Plan B: In any system where the remediation mechanism depends on the failed component, it is necessary to have alternative emergency procedures that are documented, tested, and executable without dependency on the normal control path.
Observability services must be decoupled from the control plane they observe: A telemetry agent that overwhelms Kubernetes API servers to report metrics about Kubernetes is a fundamental antipattern. Observability cannot be the failure vector of the system it monitors.
FA
My Senior Take: What I Would Do Differently
Senior Solutions Architect

This incident bothers me for a specific reason: it is not an exotic scenario. It is exactly the type of failure that happens when a competent engineering team grows too fast for its rollout practices to keep pace with infrastructure scale. What I would implement, in priority order: 1. Per-client connection limits on API servers: Kubernetes allows configuring --max-requests-inflight and --max-mutating-requests-inflight on API servers, but that is not sufficient alone. What is missing is an admission mechanism that limits the number of watch connections per ServiceAccount or per workload label. A telemetry DaemonSet should not be able to open more than N simultaneous watch connections, regardless of how many nodes exist in the fleet. 2. Control plane alerts as primary SLOs: I would treat API server latency (p99 of LIST and WATCH requests) and etcd error rate as first-class SLOs — not as infrastructure metrics someone checks after something already broke. These alerts need to fire before DNS cache expires. 3. Mandatory rollout rings for DaemonSets: Any new or modified DaemonSet should go through a promotion pipeline with automatic gates: 1 node → 1% → 10% → 100%, with at least 10 minutes of stabilization between each phase and API server health metrics as promotion conditions. 4. Emergency runbook without API server: I want my team to be able to remove a problematic DaemonSet from all nodes even if the API servers are unreachable. This means having tested scripts that operate via direct node access (SSM, emergency access), not just via kubectl. 5. Observability isolation: Telemetry agents that watch Kubernetes should use ServiceAccounts with minimal permissions and, ideally, operate against a dedicated API server for observability workloads — separate from the API servers serving production traffic. This is additional cost, but it is the correct price to prevent your observability from being the failure vector of your production. What concerns me most in this case is not what happened — it is that it could have been much worse. If the fleet were larger or if the gradual recovery mechanism had not worked, the downtime could have been measured in days, not hours.

Verdict: Scale Without Operational Maturity Is Systemic Risk

OpenAI's December 2024 incident is a textbook case of how scale amplifies failures that would be harmless in smaller environments. A DaemonSet with aggressive initialization behavior in a 10-node cluster is a performance problem. The same DaemonSet in a fleet of thousands of nodes is a global P0. The technical cause is clear: mass watch connections saturated the Kubernetes API servers. But the systemic cause is deeper: the absence of three architectural properties that should be non-negotiable in any system at OpenAI's scale. First, gradual rollout as an invariant, not a best practice. Any change affecting all nodes in a fleet should be physically impossible to apply simultaneously without explicit, exceptional approval. This is not process — it is a technical constraint in the deploy pipeline. Second, independent control plane monitoring. The fact that the impact was masked for 15-20 minutes by DNS cache indicates that control plane health alerts were not sufficiently sensitive or were not monitored with the same priority as product SLIs. In a system where the control plane is the failure point of everything, it must be the first item on the on-call dashboard. Third, decoupling between observability and the observed system. A telemetry agent that depends on Kubernetes API servers to function, and that can simultaneously saturate those API servers, creates a failure loop that is structurally unacceptable. Observability must be designed to be the last system to fail, not the first. OpenAI published a transparent and detailed post-mortem — which is, in itself, a maturity practice that many organizations lack. The real value of this incident for the industry is that it documents, publicly and specifically, how the interaction between scale, caching, and control plane dependency can create catastrophic failures from apparently routine changes. Any team operating Kubernetes at scale should use this incident as an architectural review checklist.

#kubernetes#postmortem#control-plane#telemetry#openai#resiliência#dns#rollout
Share:
Written with AI assistance from the public case and my architect's reading.