# AWS S3 us-east-1 (2017): when a typo brings down the internet

On February 28, 2017, an AWS engineer executed a debug command with an incorrect parameter and removed far more servers from S3's indexing subsystem than intended. The result was four hours of severe degradation in us-east-1 that affected hundreds of dependent services — from monitoring tools to major SaaS platforms. The incident exposed structural weaknesses around implicit dependencies, the absence of rate limiting on destructive operations, and the fallacy of assuming a single AWS region would be sufficiently resilient.

- URL: https://fernando.moretes.com/studies/aws-s3-us-east-1-2017

- Markdown: https://fernando.moretes.com/studies/aws-s3-us-east-1-2017/study.md?lang=en

- Type: Post-mortem

- Company: AWS

- Domain: Resiliência

- Date: 2017-02-28

- Tags: s3, aws, resiliência, postmortem, us-east-1, blast-radius, operações, dependência-implícita

- Reading time: 10 min

---

A single misconfigured command, executed by an experienced engineer during a routine investigation, was enough to degrade Amazon S3 in us-east-1 for nearly four hours — and with it, a significant fraction of the internet infrastructure the western world uses daily. It was not an attack. It was not a hardware failure. It was a human error amplified by the absence of guardrails on destructive operations and by years of accumulated dependencies around a single service in a single region.

## Incident Fact Sheet

- **Company / Service:** Amazon Web Services — Amazon S3
- **Date:** February 28, 2017
- **Total degradation duration:** ~4 hours (start ~09:37 PST, full recovery ~13:54 PST)
- **Affected region:** us-east-1 (US Standard)
- **Affected S3 subsystems:** Index (metadata/placement) and Placement — both severely capacity-reduced
- **Root cause:** Incorrect parameter in debug command removed excess servers from S3's Index subsystem
- **Direct impact:** S3 GET, LIST, and DELETE with high error rates; PUT largely unavailable for most of the period
- **Cascading impact (public examples):** AWS Console, CloudFormation, Lambda, ECS, Elastic Beanstalk, Kinesis, SNS, SQS, RDS, EC2 (images/AMIs), Slack, GitHub, Quora, IFTTT, Trello, Medium, Docker Hub, and hundreds of others
- **Relevant technical stack:** Amazon S3 (internal subsystems: Index, Placement, Storage nodes); internal capacity management tooling
- **Notable irony:** The AWS Service Health Dashboard itself depended on S3 to render assets — it was degraded during the incident

## What happened

On the morning of February 28, 2017, the S3 engineering team was investigating slowness in the S3 billing subsystem in us-east-1. To assist in diagnosis, an engineer executed an internal capacity management tool with the goal of removing a small set of servers from the **Index** subsystem — the component responsible for maintaining metadata about where objects are stored and for coordinating placement operations (where new objects should be written).

The problem was simple and devastating: the parameter passed to the command specified a far larger number of servers than intended. Instead of removing a small fraction of Index capacity for diagnostic purposes, the command removed a substantial portion of the entire subsystem. The tool, by design, had no rate limiting or safety validation mechanism that would prevent removing more than a safe percentage of operating capacity.

S3's Index subsystem is, by nature, the central coordination point for all service operations. Without sufficient Index capacity, S3 cannot resolve where objects are stored (affecting GET and LIST), nor determine where new objects should be written (affecting PUT). The effect was immediate: error rates spiked across all S3 operations in us-east-1.

Recovery was slower than expected for an important technical reason: the Index subsystem, when restarted at full capacity, needed to execute a **consistency verification and state repair process** before it could accept production traffic. This process — necessary to guarantee data durability and consistency — took significantly longer than AWS had planned in its recovery runbooks, because the Index subsystem had not been fully restarted at scale for a very long time. Restart time had grown alongside S3's data volume, but recovery procedures had not been reviewed with the same frequency.

## Timeline

1. **~09:37 PST — Command executed** — Engineer executes internal capacity management tool with incorrect parameter. Excess servers are removed from S3's Index subsystem in us-east-1. Impact is nearly immediate.

2. **~09:40 PST — First alerts** — S3 error rates spike. AWS internal alerts fire. Services depending on S3 begin reporting failures — including AWS's own internal monitoring tools.

3. **~09:45–10:00 PST — Externally visible cascade** — AWS Console, CloudFormation, Lambda, ECS, and dozens of other AWS services begin partially failing. External customers start reporting failures en masse. The Service Health Dashboard is slow to update — its own assets are in S3.

4. **~10:00–11:00 PST — Diagnosis and restart decision** — The team identifies the root cause. The decision is made to restore full Index subsystem capacity. The restart process is initiated, but requires state consistency verification before accepting traffic — a process slower than runbooks anticipated.

5. **~11:00–12:30 PST — Partial Index recovery** — The Index subsystem begins gradually recovering capacity. Some S3 operations start functioning intermittently. The Placement subsystem, also affected, begins its own recovery process.

6. **~12:30–13:54 PST — Full recovery** — Full Index and Placement capacity is restored. Error rates return to normal. AWS declares the incident resolved at ~13:54 PST. Total duration: approximately 4 hours and 17 minutes.

## Failure Flow: S3 us-east-1

The diagram reconstructs the cascading failure flow from the improper removal of S3's Index subsystem capacity, showing how degradation propagated to AWS internal services and then to external customers.

### 🛠️ Operação de Debug (Causa Raiz)

- Engenheiro AWS (debug session) (user)
- Capacity Mgmt Tool (ferramenta interna) (compute)

### 🗄️ S3 Internals — us-east-1

- S3 Index Subsystem (metadados / placement coord.) (data)
- S3 Placement Subsystem (onde escrever novos objetos) (data)
- S3 Storage Nodes (dados em repouso — íntegros) (storage)
- S3 API Layer (GET / PUT / LIST / DELETE) (frontend)

### ☁️ Serviços AWS Dependentes

- AWS Console (assets em S3) (frontend)
- CloudFormation (templates em S3) (compute)
- Lambda (código de funções em S3) (compute)
- ECS / Elastic Beanstalk (imagens / configs em S3) (compute)
- Service Health Dashboard (assets em S3 — irônico) (frontend)
- Kinesis, SNS, SQS, RDS... (dependências indiretas) (messaging)

### 🌐 Clientes Externos

- Slack, GitHub, Trello Medium, Quora, IFTTT... (external)
- Usuários Finais (impacto percebido) (user)

### Flows

- engineer -> mgmt-tool: executes with wrong param
- mgmt-tool -> s3-index: removes excess servers ⚠️
- s3-index -> s3-placement: no placement coordination
- s3-index -> s3-api: GET/LIST fail (no metadata)
- s3-placement -> s3-api: PUT fails (no destination)
- s3-api -> aws-console: 5xx errors
- s3-api -> cloudformation
- s3-api -> lambda
- s3-api -> ecs
- s3-api -> health-dashboard: dashboard blind 😬
- s3-api -> other-aws
- aws-console -> saas-clients
- other-aws -> saas-clients
- saas-clients -> end-users: user-visible failures

> **Root Cause:** An engineer executed an internal capacity management tool with an incorrect numeric parameter, removing a substantial fraction of S3's Index subsystem servers in us-east-1 in a single operation. The tool had no protection mechanism against removing capacity above a safe threshold (e.g., maximum X% of the fleet in a single operation). The Index subsystem, without sufficient capacity, became unable to coordinate S3 read and write operations. Recovery was prolonged because the Index subsystem restart process — which requires state consistency verification before accepting traffic — took far longer than runbooks anticipated, as restart time had grown proportionally with S3's data volume without operational procedures being updated accordingly.

## The Anatomy of the Cascade

What made this incident particularly instructive was not the root cause itself — a human error in an administrative operation is mundane — but the **amplitude of the blast radius** and the structural reasons behind it.

**Why did so many AWS services fail?** S3 in us-east-1 had become, over years, a de facto infrastructure dependency for an enormous number of AWS internal services. Lambda stored function code packages in S3. CloudFormation stored and read templates from S3. ECS and Elastic Beanstalk depended on S3 for configurations and images. The AWS Console itself used S3 to serve static assets. Many of these dependencies had been introduced incrementally, without systematic analysis of how an S3 failure would propagate through the ecosystem. The result was a graph of implicit dependencies that nobody had fully mapped until the central node failed.

**The Service Health Dashboard irony** deserves special attention. The primary mechanism by which AWS communicates service status to customers depended on the very service that was failing to render its interface. During the first hours of the incident, the dashboard was degraded or stale — exactly when customers most needed reliable information. This is not just operational irony; it is a classic architectural anti-pattern: **the control plane must not depend on the data plane it is monitoring**.

**Why did recovery take so long?** AWS was transparent on this point in its public post-mortem: the Index subsystem restart process had grown in duration over time, proportional to the growth in data volume managed by S3. The consistency verification process — essential to ensure no data was corrupted or lost — simply took longer in 2017 than it did when the runbooks were written. Nobody had tested a full subsystem restart in production recently enough to notice the drift. This is a clear example of **configuration drift in operational procedures**: systems evolve, but recovery procedures fall behind.

## Remediation and AWS Actions

AWS published a detailed public post-mortem and committed to a set of concrete corrective actions. It is worth analyzing them critically.

**1. Rate limiting and guardrails in capacity management tools.** The most direct action: add validations that prevent removing more than a safe percentage of any subsystem's capacity in a single operation. This is the obvious and necessary fix — but it also raises the question of why this protection did not exist in a service at S3's scale. Operational tools that can cause catastrophic impact must have built-in circuit breakers by design, not added after an incident.

**2. Review and update of recovery runbooks.** AWS committed to reviewing restart procedures for all critical subsystems, accounting for current restart time — not historical time. More importantly: it committed to testing these procedures regularly. This is Game Day / Chaos Engineering applied to operational procedures: you only know how long a restart takes if you execute it periodically under controlled conditions.

**3. Reduction of Service Health Dashboard dependencies on S3.** AWS explicitly acknowledged the dashboard problem and committed to making it more resilient to S3 failures. The control plane needs an alternative data plane — or, ideally, should be architected to operate completely independently of the service it monitors.

**4. Audit of internal S3 dependencies.** Implicit in the post-mortem is the acknowledgment that internal dependency mapping was insufficient. Remediation here is harder: it requires a systematic audit of which services depend on S3 for critical control functions (not just data), and the introduction of fallbacks or caches to reduce the impact of a future degradation.

What AWS did **not** say publicly — but which is an obvious consequence for architects who read the incident — is that designing a global service with a single centralized coordination point (the Index subsystem) in a single region creates systemic risk that no operational guardrail fully eliminates. The long-term architectural response is distribution and blast radius isolation by design, not just by procedure.

## Technical Lessons

- **Destructive operations require circuit breakers by design.** Any tool that can reduce the capacity of a critical subsystem must have a maximum impact limit per operation — configurable, auditable, and not bypassable without explicit approval. This is not bureaucracy; it is safety engineering.
- **Implicit dependencies are resilience technical debt.** S3's dependency graph in us-east-1 had grown undocumented for years. Control-plane services (console, health dashboard, deployment tools) must not depend on the same data plane they serve — or must have explicit, tested fallbacks.
- **The control plane cannot depend on the data plane it monitors.** The Service Health Dashboard being degraded during the incident is the most visible example, but the principle applies to any observability, alerting, and status communication system. These systems need explicit blast radius isolation.
- **Runbooks have an expiration date.** Recovery procedures written for a system at a certain scale become inaccurate — sometimes dangerously so — as the system grows. Testing runbooks in production periodically (Game Days, chaos drills) is the only way to keep estimated recovery time aligned with reality.
- **Multi-region is not just for disaster recovery.** Many affected customers had single-region architectures in us-east-1. A 4-hour incident in a single region is tolerable if you have failover to another. For critical workloads, active-active or active-passive multi-region with automated failover is not over-engineering — it is the cost of the SLA.
- **Human errors are inevitable; resilient systems contain them.** Blameless analysis does not mean ignoring that a human made an error — it means recognizing that well-designed systems do not allow individual errors to cause catastrophic impact. The error was the trigger; the absence of guardrails was the systemic cause.

> **Senior Architect Perspective:** I have worked with mission-critical financial systems for over 16 years, and what strikes me about this incident is not the error itself — it is what the error revealed about the state of AWS's internal dependencies in 2017.

S3 had quietly become a single point of failure for an enormous fraction of internet infrastructure. Not through bad intent, but through incremental accumulation of individually reasonable decisions. Each team that decided to store their assets in S3 was making the locally correct choice. The problem is that nobody was looking at the global dependency graph and asking: 'what happens if S3 is unavailable for 4 hours?'

In financial systems, we call this **risk concentration**. Regulators require you to map, limit, and test your exposure to single points of failure — not just in hardware, but in vendors, services, and shared infrastructure. The cloud industry is still learning to apply that rigor.

If I were redesigning the architecture of any critical service affected by this incident, I would apply three principles immediately: **1) Blast radius budgets** — explicitly define the maximum acceptable impact of a failure of any external dependency, and architect to not exceed it. **2) Control plane isolation** — observability tools, health dashboards, and incident communication systems must be deployed on infrastructure completely separate from the service they monitor. **3) Dependency graph audits** — periodically map and review all runtime dependencies, especially implicit ones.

## Verdict

The S3 incident of February 2017 is a near-perfect case study in how complex systems fail: not through a single catastrophic cause, but through the intersection of a mundane human error with absent guardrails, accumulated implicit dependencies, and operational procedures that did not keep pace with system growth.

The most important lesson is not technical — it is organizational. AWS had built, over years, an ecosystem of interdependent services without a systematic mechanism to map and limit risk concentration around critical components. When the central node failed, the amplitude of the impact surprised even AWS itself.

For architects and engineers who read this document: **the question you should be asking about your architecture today is not 'what happens if my application fails?', but 'what happens if any of my external dependencies is unavailable for 4 hours?'**. If the answer is 'my application is also unavailable', you have work to do — regardless of which cloud provider you use.

AWS recovered, learned, and published one of the most honest public post-mortems I have seen from a large-scale cloud provider.

## References

- [AWS — Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region](https://aws.amazon.com/message/41926/)

## Case sources

- [AWS — Summary of the S3 Service Disruption](https://aws.amazon.com/message/41926/)