Studies
Post-mortemAWSResiliência

AWS Kinesis us-east-1 (2020): When an OS Thread Limit Took Down Half of AWS

Nov 25, 2020 10 min AI-assisted
Share:

Listen to study

generated on play

Generated only on first play

On demand
0:000:00
Speed
The MP3 is saved to S3 after the first play.

In November 2020, a capacity expansion on Amazon Kinesis front-end servers exhausted the operating system's per-process thread limit, triggering cascading failures that impacted Cognito, CloudWatch, Lambda, ECS, and dozens of other services in us-east-1 for over eight hours. The incident exposed hidden inter-service dependencies within AWS and the risks of seemingly safe operational changes in large-scale systems.

Incident Facts

Company / System
Amazon Web Services — Amazon Kinesis Data Streams
Incident Date
November 25, 2020
Total Duration
Approximately 8 hours (onset ~07:45 UTC; partial recovery throughout the day)
Affected Region
us-east-1 (Northern Virginia) — the largest AWS region
Origin Service
Amazon Kinesis Data Streams (front-end servers)
Impacted Services
Cognito, CloudWatch, Lambda, ECS, CloudFormation, EventBridge, Auto Scaling, AWS Console, and 20+ other services
Root Cause
OS per-process thread limit exhaustion on Kinesis front-end servers after capacity addition
Technical Stack
Java/JVM front-end servers, Kinesis shards, internal dependencies via AWS SDK (which uses Kinesis internally)
Customer Impact
Authentication failures (Cognito), missing metrics and alarms (CloudWatch), deployment and scaling failures — during Black Friday

A single operating system configuration limit — the maximum number of threads per process — turned a routine capacity expansion into one of the most wide-ranging incidents in AWS's public history. What appeared to be a safe operational adjustment revealed a web of internal dependencies that no one had fully mapped, and brought down services that, on the surface, had nothing to do with data streaming.

What Happened

On the morning of November 25, 2020 — Thanksgiving Eve in the US and the start of Black Friday week — the Amazon Kinesis team executed a planned capacity expansion in the us-east-1 region. The goal was to increase the number of Kinesis front-end servers to support the elevated load expected during the peak commercial period. This is a routine operation in any large-scale distributed system: add nodes to distribute load.

The problem lay in an implementation detail of the Kinesis front-end servers. Each front-end server maintains connections to all other nodes in the cluster — an all-to-all topology model common in systems that need to route requests to specific shards without an additional indirection layer. As the number of front-end servers grew, each server needed to open and maintain more simultaneous connections. Each connection, in turn, used a dedicated thread for I/O management.

The Linux operating system enforces a limit on the number of threads a single process can create — the ulimit -u parameter (or threads-max at the kernel level). When the number of front-end servers crossed a certain threshold, the Java processes on existing servers attempted to create new threads to manage connections to the new nodes and failed with errors like java.lang.OutOfMemoryError: unable to create new native thread. From that point, Kinesis front-end servers began rejecting requests and failing health checks.

Up to this point, the incident would have been contained to Kinesis. What turned this into a regional disaster was something most engineers outside AWS didn't know: a significant portion of internal AWS services use Kinesis itself as infrastructure for telemetry, logging, and event delivery. The internal AWS SDK, used by services like Cognito, CloudWatch, Lambda, and ECS to publish metrics and logs, depended on Kinesis. When Kinesis became degraded, those services began queuing up telemetry sends, and the very health check and automatic recovery mechanisms of those services — which relied on CloudWatch for alarms and Cognito for internal call authentication — also began to fail.

Incident Timeline

  1. 1

    ~07:30 UTC — Capacity Expansion Begins

    The Kinesis team begins adding front-end servers in us-east-1 as part of a planned scaling procedure for Black Friday week.

  2. 2

    ~07:45 UTC — First Kinesis Failures

    Existing front-end servers begin failing to create new threads to manage connections to newly added nodes. OutOfMemoryError: unable to create new native thread errors appear in logs. Health checks begin failing.

  3. 3

    ~08:00 UTC — Degradation Spreads to Cognito and CloudWatch

    Services that internally depend on Kinesis for telemetry begin showing errors. Cognito starts failing authentications. CloudWatch stops receiving metrics and alarms cease firing — including the alarms that should have alerted teams about the incident itself.

  4. 4

    ~08:30 UTC — Lambda, ECS, and Other Services Affected

    The cascade widens. Lambda shows invocation failures. ECS fails on task management operations. CloudFormation and EventBridge are impacted. The AWS Console becomes partially unavailable, hampering manual diagnosis.

  5. 5

    ~09:00 UTC — Root Cause Diagnosed

    AWS teams identify thread exhaustion as the root cause. The decision is made to roll back the capacity expansion and reduce the number of front-end servers below the critical threshold.

  6. 6

    ~10:00–14:00 UTC — Gradual Recovery

    The Kinesis rollback begins taking effect. Dependent services recover as Kinesis resumes accepting connections. Recovery is non-uniform: some services recover quickly, others take hours due to accumulated queues and inconsistent states.

  7. 7

    ~18:00 UTC — Substantial Recovery

    Most affected services return to normal operation. AWS publishes updates on the Service Health Dashboard throughout the day, though with a lag relative to the actual severity of the incident.

Failure Flow: From Thread Limit to Cascading Collapse

The diagram reconstructs how thread exhaustion on Kinesis front-end servers propagated to seemingly unrelated services through internal telemetry and authentication dependencies.

🔧 Operação de Scaling (Gatilho)
  • Capacity Expansion · Novos front-ends adicionados
📡 Kinesis Front-End Layer
  • Front-End Servers · (existentes) · All-to-all mesh
  • Front-End Servers · (novos) · Adicionados no scaling
  • OS Thread Limit · ulimit -u esgotado · OOME: native thread
🗄️ Kinesis Data Plane
  • Kinesis Shards · Backend storage
📊 Serviços Dependentes (Telemetria Interna)
  • CloudWatch · Métricas e alarmes · param de funcionar
  • Amazon Cognito · Autenticação falha · para usuários finais
  • AWS Lambda · Falhas de invocação · e cold start
  • Amazon ECS · Falhas de task · management
🔄 Serviços de Controle Afetados
  • CloudFormation · Deploys falham
  • EventBridge · Entrega de eventos · degradada
  • AWS Console · Parcialmente · indisponível
👤 Impacto Final
  • Clientes AWS · Autenticação, deploys, · scaling, observabilidade

Root Cause: The Limit Nobody Was Monitoring

The root cause is technically simple and operationally devastating: Kinesis front-end servers used one thread per outbound connection to manage the all-to-all mesh between cluster nodes. When new servers were added, each existing server attempted to open new connections — and therefore create new threads — until hitting the operating system limit (ulimit -u). On Linux, this limit is per-process and, in default configurations of many distributions, can be surprisingly low (typically 4096 to 32768 threads per process, depending on configuration). The JVM, upon failing to create a new native thread, throws OutOfMemoryError: unable to create new native thread — this is not a heap error; it is an OS resource error. The server was not out of RAM; it was out of space in the kernel's thread table. The limit was not being actively monitored, and there was no circuit breaker preventing the addition of new nodes when a target server was near the critical threshold.

The Cascade Effect: Dependencies Nobody Documented

The most instructive aspect of this incident is not the bug itself — it is what it revealed about AWS's internal architecture. Kinesis is not just a streaming service for external customers; it is a critical internal infrastructure component used by other AWS services to transport telemetry, logs, and control events. This is an architectural decision that makes sense: using your own product as internal infrastructure (dogfooding) validates the service under real production load and reduces the need to maintain parallel observability stacks.

The problem is that this dependency created a structural coupling that was not publicly documented. When Kinesis failed, CloudWatch — which depended on Kinesis to ingest metrics — stopped receiving data. Without incoming metrics, alarms that should have fired automatically went silent. This created a situation of alarm blindness: the very tools that should have detected and alerted on the incident were compromised by the incident. On-call teams lost visibility precisely when they needed it most.

Cognito, in turn, used Kinesis for internal telemetry. When Kinesis became degraded, Cognito authentication calls began failing — not because Cognito itself was broken, but because the telemetry path was blocked and this affected the control flow. This directly impacted end users who depended on Cognito for authentication in their applications, and also impacted the AWS Console, which uses Cognito for session authentication. The result was that engineers trying to diagnose the problem through the Console also encountered difficulties logging in.

This pattern — where the observability system depends on the system that is failing — is a classic resilience anti-pattern. In financial systems where I've worked, we call this observer coupling: when the observer and the observed share the same failure plane, you lose diagnostic capability exactly when you need it most. The architectural solution is to ensure that the control and observability plane is completely independent of the data plane being monitored.

Remediation: What AWS Did and What They Promised to Change

The immediate remediation was straightforward: roll back the capacity expansion, reducing the number of front-end servers below the threshold that caused thread exhaustion. As front-end servers returned to normal operation, dependent services began recovering — first Kinesis itself, then CloudWatch, then Cognito, and so on, in reverse order of the failure cascade.

AWS published a detailed post-mortem (Summary of the Kinesis Event) that identified several corrective actions. In terms of immediate mitigation, the team increased OS thread limits on front-end servers and modified the code to use asynchronous connections (non-blocking I/O) instead of dedicated threads per connection — an architectural change that eliminates the linear dependency between connection count and thread count. With NIO/async, a single thread pool can manage thousands of simultaneous connections.

To prevent recurrence, AWS committed to: (1) active monitoring of OS resource metrics, including thread usage, across all critical services; (2) implementing capacity tests that validate OS resource limits before operational scaling changes; (3) reviewing internal inter-service dependencies to identify and mitigate couplings that could propagate failures; (4) improving the observability plane to ensure critical service health metrics are delivered through a path independent of Kinesis when Kinesis is degraded.

From a customer perspective, the operational lesson is clear: do not assume the AWS Service Health Dashboard accurately and in real time reflects the state of all services. During this incident, the dashboard itself was compromised by the same problem affecting CloudWatch. Critical systems must have independent synthetic health checks, outside AWS or across multiple regions, that validate actual application behavior — not just the status reported by the provider.

FA
My Senior Take: What This Incident Really Teaches
Senior Solutions Architect

I've worked with mission-critical distributed systems for over 16 years, including financial infrastructure where eight-hour outages are simply unacceptable. This incident interests me less for the bug itself — a misconfigured ulimit is a classic operational mistake — and far more for what it reveals about the nature of dependencies in large-scale systems. The point that strikes me most is observer coupling: CloudWatch depended on Kinesis to function, and Kinesis was broken. This is not a bug; it is an architectural decision that became a single point of failure for the observability plane. In any system I design, the observability plane (metrics, logs, alerts) must be architecturally isolated from the data plane. If you use the same bus to transport business data and system telemetry, a bus failure blinds you exactly when you need visibility most. Second point: the all-to-all topology between front-end servers is a choice that scales O(n²) in connection count. For n=10 servers, that's 90 connections. For n=100, it's 9,900. This kind of quadratic growth in resources is a time bomb in any system that needs to scale horizontally. The migration to NIO/async solves the thread problem, but does not solve the O(n²) connection problem — for that, you need an intermediate routing layer (a service mesh, a shard proxy, or a gossip protocol) that reduces the required connectivity. Third point, and perhaps the most important for anyone building systems on AWS: AWS's internal dependencies are a black box. You don't know which services depend on which internally. The correct archi

Lessons Learned

OS resource limits are invisible until they explode. ulimit -u, file descriptors, max_map_count — these parameters rarely appear in standard capacity dashboards. Monitor them actively and set alerts with safety margins (e.g., alert at 70% of the maximum limit).
All-to-all topologies scale quadratically. Any mesh where each node connects to all others has O(n²) growth in connections. Before scaling horizontally, validate that the network topology supports the target node count without exhausting resources.
The observability plane cannot depend on the data plane it monitors. If your alerting system uses the same infrastructure it is monitoring, you lose visibility at the worst moment. Architecturally separate the telemetry path from the data path.
Internal cloud provider dependencies are black boxes. You don't know which AWS services depend on which internally. Design for the failure of any individual service, including Cognito, CloudWatch, and Kinesis.
Thread-per-connection is an anti-pattern for high-connection servers. By 2020, NIO/async was already the standard approach for high-concurrency servers. Systems still using thread-per-connection have a deterministic and low scaling ceiling.
Freeze high-risk operational changes before critical business periods. Expanding capacity on Black Friday Eve is an unnecessary risk. Infrastructure changes should be completed and stabilized with enough lead time to allow safe rollback.

AWS Well-Architected Framework Analysis

Security

Indirect impact: Cognito failure compromised end-user authentication flows. Authentication dependencies without fallback became single points of failure for access to critical systems.

Reliability

Critical failure across multiple dimensions: absence of circuit breakers for OS resource limits, network topology with quadratic growth not tested at scale, circular dependencies between service and its observability system, and absence of fault isolation between data plane and control plane.

Sustainability

Not applicable as a causal factor in this incident.

Verdict: A Simple Incident with Complex Lessons

The November 2020 Kinesis incident is, on the surface, a story about a misconfigured ulimit. At depth, it is a story about how large-scale systems accumulate hidden dependencies that only become visible when they fail — and about how the failure of one component can paralyze the very diagnostic system that should have detected it. AWS deserves credit for the detailed post-mortem and transparency about the root cause. But the incident also exposes a fundamental tension in the operating model of a cloud provider: the more you use your own services as internal infrastructure (dogfooding), the more you create couplings that can amplify local failures into regional collapses. For engineers building on AWS — or any cloud provider — the central lesson is this: you cannot assume that your provider's control plane is independent of the data plane you use. Design your systems assuming any AWS service can fail, including observability services. Implement external synthetic health checks. Have authentication fallbacks. And never execute high-risk operational changes without an adequate stabilization window before critical business periods. The thread limit was a configuration detail.

#postmortem#aws#kinesis#resiliência#cascata#threads#us-east-1#dependências
Share:
Written with AI assistance from the public case and my architect's reading.