# AWS us-east-1 (2025): Datacenter Thermal Event Takes Down EC2 and EBS

A cooling system failure in an us-east-1 datacenter caused servers to overheat, triggering thermal protection shutdowns that degraded EC2 and EBS across the region. The incident reignites critical debates about workload concentration in us-east-1, the role of physical AZ isolation, and the need for graceful degradation in production architectures.

- URL: https://fernando.moretes.com/studies/aws-us-east-1-thermal-2026

- Markdown: https://fernando.moretes.com/studies/aws-us-east-1-thermal-2026/study.md?lang=en

- Type: Post-mortem

- Company: AWS

- Domain: Resiliência

- Date: 2026-06-20

- Tags: aws, us-east-1, resiliencia, ec2, ebs, postmortem, thermal-event, multi-region

- Reading time: 10 min

---

## Incident Fact Sheet

- **Provider:** Amazon Web Services (AWS)
- **Affected region:** us-east-1 (Northern Virginia)
- **Incident date:** January 2025
- **Impacted services:** Amazon EC2, Amazon EBS, dependent services (RDS, ECS, Lambda in some cases)
- **Root cause:** Cooling system failure in a physical datacenter within an us-east-1 AZ
- **Failure mechanism:** Physical host overheating → thermal protection shutdown → loss of compute and storage capacity
- **Blast radius:** Subset of EC2 instances and EBS volumes in affected AZ(s); cascading impact to dependent managed services
- **Amplified risk profile:** us-east-1 hosts the highest density of AWS workloads globally

Cooling is not glamorous architecture — but it is availability. When the cooling system of a datacenter in us-east-1 failed in January 2025, physical servers hit critical thermal thresholds and automatically shut down to protect themselves. The result was EC2 and EBS degradation at regional scale, affecting one of the highest concentrations of critical workloads on the planet. This post-mortem analyzes what happened, why physical infrastructure remains the lower bound of every cloud abstraction, and what resilient architectures must incorporate to survive events that no software SLA can mitigate.

## What happened: from physical failure to production impact

In January 2025, a thermal event in a physical datacenter within AWS's us-east-1 region triggered one of the most discussed outages of the year in the industry. The immediate cause was a cooling system failure — HVAC or equivalent cooling infrastructure — that resulted in abnormal temperature rise inside the affected datacenter. Physical hosts running EC2 instances and serving EBS volumes reached temperature thresholds that triggered automatic thermal protection mechanisms: the servers simply shut down to prevent permanent hardware damage.

This type of failure has a particularly brutal characteristic from an operational standpoint: it is not gradual. A server that shuts down due to thermal protection does not degrade gracefully — it stops. EC2 instances running on those hosts were abruptly terminated. EBS volumes whose data resided on affected storage nodes became inaccessible. Propagation was immediate for any service depending on those instances or volumes: RDS databases with primary instances on affected hosts initiated failover (when configured with Multi-AZ), ECS clusters lost tasks, data pipelines stopped.

AWS confirmed the incident via the AWS Health Dashboard, describing EC2 and EBS degradation in us-east-1. The exact scope — how many AZs were affected, what the density of impacted hosts was — was not disclosed with full granularity, which is standard for AWS incident communications. What became clear from the Network World report and public communication was that the event was physical, not logical: there was no software patch, configuration rollback, or alternative routing that could have prevented the impact on the hosts that overheated.

## Incident Timeline

1. **T-0: Cooling failure** — The cooling system of a physical datacenter in us-east-1 fails. Internal temperature begins rising above safe operational thresholds.

2. **T+minutes: Thermal protection activated** — Physical hosts reach critical temperature thresholds. Automatic thermal protection mechanisms shut down servers to prevent permanent hardware damage. EC2 instances are abruptly terminated; EBS volumes become inaccessible.

3. **T+minutes to hours: Cascading propagation** — Dependent services begin to fail: RDS initiates Multi-AZ failover where configured; ECS and other orchestrators attempt to reallocate tasks; single-AZ applications become completely unavailable. Alarms fire en masse in customer monitoring systems.

4. **T+hours: AWS confirms incident** — AWS publishes an update on the AWS Health Dashboard confirming EC2 and EBS degradation in us-east-1 and indicating active investigation. AWS engineering teams work to restore cooling and assess affected hosts.

5. **T+hours to days: Gradual recovery** — As cooling is restored and hosts are verified, AWS begins bringing capacity back online. Instances and volumes that survived the shutdown are restored; hosts with physical damage require hardware replacement.

6. **Post-incident: Communication and review** — AWS publishes final updates on the Health Dashboard. Intense public discussion about risk concentration in us-east-1 and multi-AZ and multi-region resilience practices.

## Failure Flow: Thermal Event in us-east-1

The diagram reconstructs how the physical cooling failure in a datacenter propagated through the AWS abstraction stack until reaching customer workloads. The failure starts at the physical layer (HVAC), traverses the host hypervisor, and impacts EC2 and EBS, which in turn degrade dependent managed services.

### 🏭 Datacenter Físico — AZ Afetada / Affected AZ

- HVAC / Cooling [FALHOU / FAILED] (network)
- Host Físico A [Superaquecido / Overheated] (compute)
- Host Físico B [Superaquecido / Overheated] (compute)
- Storage Node EBS [Inacessível / Inaccessible] (storage)

### ☁️ Camada de Virtualização AWS / AWS Virtualization Layer

- EC2 Instances [Terminadas / Terminated] (compute)
- EBS Volumes [Degradados / Degraded] (storage)

### 🗄️ Serviços Gerenciados / Managed Services

- Amazon RDS [Failover iniciado / Failover initiated] (data)
- Amazon ECS [Tasks perdidas / Tasks lost] (compute)
- AWS Lambda [Degradação parcial / Partial degradation] (compute)

### 👤 Workloads de Clientes / Customer Workloads

- App Single-AZ [Indisponível / Unavailable] (frontend)
- App Multi-AZ [Degradada / Degraded] (frontend)
- App Multi-Region [Sobreviveu / Survived] (frontend)

### Flows

- hvac -> host1: Critical temperature
- hvac -> host2: Critical temperature
- hvac -> storage_node: Critical temperature
- host1 -> ec2: Thermal shutdown
- host2 -> ec2: Thermal shutdown
- storage_node -> ebs: Inaccessible
- ec2 -> rds: Primary instance lost
- ec2 -> ecs: Tasks terminated
- ebs -> rds: Inaccessible storage
- rds -> single_az_app: DB unavailable
- ecs -> single_az_app: No compute
- rds -> multi_az_app: Multi-AZ failover
- ec2 -> multi_az_app: Healthy AZ takes over
- ec2 -> multi_region_app: Traffic diverted

> **Root Cause: Physical Infrastructure Is the Lower Bound of Every Abstraction:** The root cause of this incident was not a software bug, a network misconfiguration, or a deployment error. It was the failure of a physical cooling system — HVAC — that resulted in real hardware overheating. This exposes a fundamental truth that cloud abstractions tend to obscure: every EC2 instance runs on a physical server, in a physical rack, in a physical datacenter, with physical power and cooling systems. When those systems fail severely enough, no software layer can compensate. The automatic thermal protection of hosts is a correct safety mechanism — it prevents permanent hardware loss — but its side effect is indistinguishable from catastrophic failure for running workloads. AZ isolation exists precisely to contain this type of physical blast radius, but it only protects architectures that effectively distribute load across AZs.

## Blast Radius: why us-east-1 hurts more

us-east-1 is not just another AWS region. It is historically the first AWS region, the oldest, the one with the highest density of available services and, consequently, the one that concentrates the largest proportion of critical workloads globally. Many organizations chose us-east-1 by default — because it was the only option when they started, because their teams know it best, because certain services were only available there for years. The result is a risk concentration that amplifies the impact of any event in that region.

When a thermal event occurs in us-east-1, the blast radius is not just technical — it is business-level. Companies that never seriously considered multi-region because 'AWS never goes down' discover that AWS does go down, and that it goes down exactly in the region where they put everything. The irony is that us-east-1 is also historically the region with the most documented incidents — not necessarily because it is less reliable by design, but because it has more workloads, more traffic, more operational pressure, and more visibility when something fails.

From a physical blast radius perspective, the AWS AZ model is theoretically correct: each AZ is one or more physically separate datacenters, with independent power and networking. A thermal event in a specific datacenter should, in theory, be contained within that AZ. The problem is twofold: first, not all customer architectures genuinely distribute load across AZs in a resilient way — many have implicit dependencies on a single AZ (EBS volumes are not replicated across AZs by default, for example). Second, even with Multi-AZ configured, failover has latency and can introduce availability windows that violate SLOs for latency-sensitive applications.

## Remediation and what resilient architectures must incorporate

The immediate remediation of the incident was AWS's responsibility: restore the cooling system, verify the integrity of affected hosts, replace damaged hardware, and bring capacity back online in a controlled manner. This process takes hours to days depending on the extent of physical damage — and there is no shortcut. Hardware that overheated needs to be inspected before being returned to production.

But the architectural remediation — the work that belongs to customer engineering teams — is more interesting and more durable. It starts with an honest question: what happens to my system if an entire AZ is unavailable for 4 hours? If the answer is 'my system is completely unavailable,' the architecture has a concentration risk that needs to be addressed.

The practices that effectively limit the blast radius of events like this are well known, but frequently underestimated in implementation:

**Genuine distribution across AZs**: Having instances in multiple AZs is not enough if the primary database, message queue, or cache are in a single AZ. Resilience needs to be end-to-end. Auto Scaling Groups should have `balance across AZs` enabled. RDS Multi-AZ should be the default for any production database.

**EBS and the AZ affinity problem**: EBS volumes are zonal resources — they exist in a specific AZ and cannot be accessed from another. This means any EC2 instance that depends on an EBS volume has an implicit dependency on the AZ where that volume exists. For data that needs to survive AZ failures, the options are: data replication at the application layer, use of EFS (which is regional and multi-AZ), or migration to managed storage services that abstract replication.

**Graceful degradation**: Systems that cannot degrade gracefully turn partial failures into total failures. Circuit breakers, fallbacks to cached data, message queues to absorb spikes during recovery — these patterns make the difference between 'degraded but functional' and 'completely unavailable.'

**Multi-region as strategy, not luxury**: For workloads with availability SLOs above 99.9%, single-region — especially single-region in us-east-1 — is not a defensible strategy. Multi-region with Route 53 failover, Global Accelerator, or active-active with data replication is the correct path. The cost is real, but it needs to be compared against the cost of hours of production downtime.

**Chaos engineering and game days**: No multi-AZ or multi-region architecture that has never been tested under real failure can be considered validated. AZ failure injection in staging environments, game days simulating region loss — these exercises reveal hidden dependencies that architecture documents do not capture.

## Incident Lessons

- **Cooling is availability**: HVAC and cooling systems are first-class availability dependencies. Physical infrastructure failures are not mitigable by software — they define the lower bound of any cloud SLA.
- **us-east-1 is concentration risk**: The largest AWS region in workload density is also the one that amplifies the impact of any incident. Single-region in us-east-1 is not a defensible strategy for SLOs above 99.9%.
- **AZ isolation only works if the architecture respects AZ boundaries**: EBS volumes are zonal. EC2 instances are zonal. Implicit dependencies on a single AZ nullify the benefit of the multi-AZ model.
- **Graceful degradation is design, not optimization**: Systems without fallbacks, circuit breakers, and absorption queues turn partial failures into total unavailability. This is a design choice, not an accident.
- **Multi-AZ failover has latency — and that matters**: RDS Multi-AZ failover typically takes 60-120 seconds. For applications that cannot tolerate this window, the architecture needs additional resilience mechanisms at the application layer.
- **Chaos engineering validates what documents do not capture**: Hidden AZ dependencies only surface under real failure. Game days and fault injection in staging are reliability investments, not process overhead.

> **My Perspective: What I Would Do Differently:** I have worked with high-availability systems for over 16 years, including financial infrastructure where minutes of downtime have measurable cost in regulation and revenue. This incident does not surprise me — it only surprises those who never took seriously the phrase 'the cloud is someone else's computer.'

My position is direct: any workload with an SLO above 99.5% running single-region in us-east-1 in 2025 has an architectural debt that needs to be addressed, not postponed. The cost of active-passive multi-region with Route 53 failover is a fraction of the cost of a 4-hour production outage for most digital businesses.

But what bothers me most about this incident is not the AWS failure — physical datacenters fail, and AWS has a reasonable track record of containment. What bothers me is the recurring pattern of architectures that treat the cloud as if it were infinitely resilient by design. EBS being used as primary storage without replication in critical workloads. RDS without Multi-AZ in production. Auto Scaling Groups anchored to a single AZ for configuration convenience.

If I were reviewing a customer's architecture after this incident, my first question would be: 'show me the AZ failure runbook.' If it does not exist, or if it exists but has never been tested, that is the most urgent work — not cost optimization, not the new feature, not the version upgrade. Resilience needs to be tested to be real. And us-east-1, given its historical density of workloads and incidents, should be treated with the same respect given to any single point of failure — because, for those who have not diversified, that is exactly what it is.

## Verdict: Physics Does Not Abstract

The thermal event in us-east-1 is a reminder that every cloud architecture has a physical substrate — and that substrate can fail in ways no software layer can compensate for. AWS did what was correct: hosts shut down to protect hardware, and the team worked to restore service. The AZ model exists to contain exactly this type of physical blast radius.

The problem is not AWS. The problem is concentration. us-east-1 concentrates risk for historical and operational reasons that are understandable, but not an excuse. Architectures that do not genuinely distribute load across AZs, that do not test failover, that do not have graceful degradation — these architectures turn a containable event into total unavailability.

The lessons are old and well known: distribute across AZs, test failover, implement graceful degradation, consider multi-region for high SLOs, and never treat the cloud as if it were immune to physics. What changes with each incident like this is the cost of not having learned sooner. For teams that take resilience seriously, this incident is a validation exercise. For the rest, it is a warning that should be turned into concrete architectural action — before the next thermal event.

## References

- [Network World — AWS hit by US-East-1 outage after data center thermal event](https://www.networkworld.com/article/4168878/aws-hit-by-us-east-1-outage-after-data-center-thermal-event.html)
- [AWS Health Dashboard — Service Health History](https://health.aws.amazon.com/)
- [AWS Well-Architected Framework — Reliability Pillar](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html)
- [AWS Documentation — Amazon EBS Availability and Durability](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volumes.html)
- [AWS Documentation — Multi-AZ deployments for Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html)
- [AWS Blog — Building a Multi-Region Active-Active Backend](https://aws.amazon.com/blogs/architecture/building-a-multi-region-active-active-backend/)

## Case sources

- [Network World — AWS hit by US-East-1 outage after data center thermal event](https://www.networkworld.com/article/4168878/aws-hit-by-us-east-1-outage-after-data-center-thermal-event.html)
- [AWS Health Dashboard](https://health.aws.amazon.com/)