AWS Kinesis us-east-1 (2020): When an OS Thread Limit Took Down Half of AWS
Listen to study
generated on playGenerated only on first play
In November 2020, a capacity expansion on Amazon Kinesis front-end servers exhausted the operating system's per-process thread limit, triggering cascading failures that impacted Cognito, CloudWatch, Lambda, ECS, and dozens of other services in us-east-1 for over eight hours. The incident exposed hidden inter-service dependencies within AWS and the risks of seemingly safe operational changes in large-scale systems.
Incident Facts
- Company / System
- Amazon Web Services — Amazon Kinesis Data Streams
- Incident Date
- November 25, 2020
- Total Duration
- Approximately 8 hours (onset ~07:45 UTC; partial recovery throughout the day)
- Affected Region
- us-east-1 (Northern Virginia) — the largest AWS region
- Origin Service
- Amazon Kinesis Data Streams (front-end servers)
- Impacted Services
- Cognito, CloudWatch, Lambda, ECS, CloudFormation, EventBridge, Auto Scaling, AWS Console, and 20+ other services
- Root Cause
- OS per-process thread limit exhaustion on Kinesis front-end servers after capacity addition
- Technical Stack
- Java/JVM front-end servers, Kinesis shards, internal dependencies via AWS SDK (which uses Kinesis internally)
- Customer Impact
- Authentication failures (Cognito), missing metrics and alarms (CloudWatch), deployment and scaling failures — during Black Friday
A single operating system configuration limit — the maximum number of threads per process — turned a routine capacity expansion into one of the most wide-ranging incidents in AWS's public history. What appeared to be a safe operational adjustment revealed a web of internal dependencies that no one had fully mapped, and brought down services that, on the surface, had nothing to do with data streaming.
What Happened
On the morning of November 25, 2020 — Thanksgiving Eve in the US and the start of Black Friday week — the Amazon Kinesis team executed a planned capacity expansion in the us-east-1 region. The goal was to increase the number of Kinesis front-end servers to support the elevated load expected during the peak commercial period. This is a routine operation in any large-scale distributed system: add nodes to distribute load.
The problem lay in an implementation detail of the Kinesis front-end servers. Each front-end server maintains connections to all other nodes in the cluster — an all-to-all topology model common in systems that need to route requests to specific shards without an additional indirection layer. As the number of front-end servers grew, each server needed to open and maintain more simultaneous connections. Each connection, in turn, used a dedicated thread for I/O management.
The Linux operating system enforces a limit on the number of threads a single process can create — the ulimit -u parameter (or threads-max at the kernel level). When the number of front-end servers crossed a certain threshold, the Java processes on existing servers attempted to create new threads to manage connections to the new nodes and failed with errors like java.lang.OutOfMemoryError: unable to create new native thread. From that point, Kinesis front-end servers began rejecting requests and failing health checks.
Up to this point, the incident would have been contained to Kinesis. What turned this into a regional disaster was something most engineers outside AWS didn't know: a significant portion of internal AWS services use Kinesis itself as infrastructure for telemetry, logging, and event delivery. The internal AWS SDK, used by services like Cognito, CloudWatch, Lambda, and ECS to publish metrics and logs, depended on Kinesis. When Kinesis became degraded, those services began queuing up telemetry sends, and the very health check and automatic recovery mechanisms of those services — which relied on CloudWatch for alarms and Cognito for internal call authentication — also began to fail.
Incident Timeline
- 1
~07:30 UTC — Capacity Expansion Begins
The Kinesis team begins adding front-end servers in us-east-1 as part of a planned scaling procedure for Black Friday week.
- 2
~07:45 UTC — First Kinesis Failures
Existing front-end servers begin failing to create new threads to manage connections to newly added nodes.
OutOfMemoryError: unable to create new native threaderrors appear in logs. Health checks begin failing. - 3
~08:00 UTC — Degradation Spreads to Cognito and CloudWatch
Services that internally depend on Kinesis for telemetry begin showing errors. Cognito starts failing authentications. CloudWatch stops receiving metrics and alarms cease firing — including the alarms that should have alerted teams about the incident itself.
- 4
~08:30 UTC — Lambda, ECS, and Other Services Affected
The cascade widens. Lambda shows invocation failures. ECS fails on task management operations. CloudFormation and EventBridge are impacted. The AWS Console becomes partially unavailable, hampering manual diagnosis.
- 5
~09:00 UTC — Root Cause Diagnosed
AWS teams identify thread exhaustion as the root cause. The decision is made to roll back the capacity expansion and reduce the number of front-end servers below the critical threshold.
- 6
~10:00–14:00 UTC — Gradual Recovery
The Kinesis rollback begins taking effect. Dependent services recover as Kinesis resumes accepting connections. Recovery is non-uniform: some services recover quickly, others take hours due to accumulated queues and inconsistent states.
- 7
~18:00 UTC — Substantial Recovery
Most affected services return to normal operation. AWS publishes updates on the Service Health Dashboard throughout the day, though with a lag relative to the actual severity of the incident.
Failure Flow: From Thread Limit to Cascading Collapse
The diagram reconstructs how thread exhaustion on Kinesis front-end servers propagated to seemingly unrelated services through internal telemetry and authentication dependencies.
- Capacity Expansion · Novos front-ends adicionados
- Front-End Servers · (existentes) · All-to-all mesh
- Front-End Servers · (novos) · Adicionados no scaling
- OS Thread Limit · ulimit -u esgotado · OOME: native thread
- Kinesis Shards · Backend storage
- CloudWatch · Métricas e alarmes · param de funcionar
- Amazon Cognito · Autenticação falha · para usuários finais
- AWS Lambda · Falhas de invocação · e cold start
- Amazon ECS · Falhas de task · management
- CloudFormation · Deploys falham
- EventBridge · Entrega de eventos · degradada
- AWS Console · Parcialmente · indisponível
- Clientes AWS · Autenticação, deploys, · scaling, observabilidade
Root Cause: The Limit Nobody Was Monitoring
The root cause is technically simple and operationally devastating: Kinesis front-end servers used one thread per outbound connection to manage the all-to-all mesh between cluster nodes. When new servers were added, each existing server attempted to open new connections — and therefore create new threads — until hitting the operating system limit (ulimit -u). On Linux, this limit is per-process and, in default configurations of many distributions, can be surprisingly low (typically 4096 to 32768 threads per process, depending on configuration). The JVM, upon failing to create a new native thread, throws OutOfMemoryError: unable to create new native thread — this is not a heap error; it is an OS resource error. The server was not out of RAM; it was out of space in the kernel's thread table. The limit was not being actively monitored, and there was no circuit breaker preventing the addition of new nodes when a target server was near the critical threshold.
The Cascade Effect: Dependencies Nobody Documented
The most instructive aspect of this incident is not the bug itself — it is what it revealed about AWS's internal architecture. Kinesis is not just a streaming service for external customers; it is a critical internal infrastructure component used by other AWS services to transport telemetry, logs, and control events. This is an architectural decision that makes sense: using your own product as internal infrastructure (dogfooding) validates the service under real production load and reduces the need to maintain parallel observability stacks.
The problem is that this dependency created a structural coupling that was not publicly documented. When Kinesis failed, CloudWatch — which depended on Kinesis to ingest metrics — stopped receiving data. Without incoming metrics, alarms that should have fired automatically went silent. This created a situation of alarm blindness: the very tools that should have detected and alerted on the incident were compromised by the incident. On-call teams lost visibility precisely when they needed it most.
Cognito, in turn, used Kinesis for internal telemetry. When Kinesis became degraded, Cognito authentication calls began failing — not because Cognito itself was broken, but because the telemetry path was blocked and this affected the control flow. This directly impacted end users who depended on Cognito for authentication in their applications, and also impacted the AWS Console, which uses Cognito for session authentication. The result was that engineers trying to diagnose the problem through the Console also encountered difficulties logging in.
This pattern — where the observability system depends on the system that is failing — is a classic resilience anti-pattern. In financial systems where I've worked, we call this observer coupling: when the observer and the observed share the same failure plane, you lose diagnostic capability exactly when you need it most. The architectural solution is to ensure that the control and observability plane is completely independent of the data plane being monitored.
Remediation: What AWS Did and What They Promised to Change
The immediate remediation was straightforward: roll back the capacity expansion, reducing the number of front-end servers below the threshold that caused thread exhaustion. As front-end servers returned to normal operation, dependent services began recovering — first Kinesis itself, then CloudWatch, then Cognito, and so on, in reverse order of the failure cascade.
AWS published a detailed post-mortem (Summary of the Kinesis Event) that identified several corrective actions. In terms of immediate mitigation, the team increased OS thread limits on front-end servers and modified the code to use asynchronous connections (non-blocking I/O) instead of dedicated threads per connection — an architectural change that eliminates the linear dependency between connection count and thread count. With NIO/async, a single thread pool can manage thousands of simultaneous connections.
To prevent recurrence, AWS committed to: (1) active monitoring of OS resource metrics, including thread usage, across all critical services; (2) implementing capacity tests that validate OS resource limits before operational scaling changes; (3) reviewing internal inter-service dependencies to identify and mitigate couplings that could propagate failures; (4) improving the observability plane to ensure critical service health metrics are delivered through a path independent of Kinesis when Kinesis is degraded.
From a customer perspective, the operational lesson is clear: do not assume the AWS Service Health Dashboard accurately and in real time reflects the state of all services. During this incident, the dashboard itself was compromised by the same problem affecting CloudWatch. Critical systems must have independent synthetic health checks, outside AWS or across multiple regions, that validate actual application behavior — not just the status reported by the provider.
I've worked with mission-critical distributed systems for over 16 years, including financial infrastructure where eight-hour outages are simply unacceptable. This incident interests me less for the bug itself — a misconfigured ulimit is a classic operational mistake — and far more for what it reveals about the nature of dependencies in large-scale systems.
The point that strikes me most is observer coupling: CloudWatch depended on Kinesis to function, and Kinesis was broken. This is not a bug; it is an architectural decision that became a single point of failure for the observability plane. In any system I design, the observability plane (metrics, logs, alerts) must be architecturally isolated from the data plane. If you use the same bus to transport business data and system telemetry, a bus failure blinds you exactly when you need visibility most.
Second point: the all-to-all topology between front-end servers is a choice that scales O(n²) in connection count. For n=10 servers, that's 90 connections. For n=100, it's 9,900. This kind of quadratic growth in resources is a time bomb in any system that needs to scale horizontally. The migration to NIO/async solves the thread problem, but does not solve the O(n²) connection problem — for that, you need an intermediate routing layer (a service mesh, a shard proxy, or a gossip protocol) that reduces the required connectivity.
Third point, and perhaps the most important for anyone building systems on AWS: AWS's internal dependencies are a black box. You don't know which services depend on which internally. The correct archi
Lessons Learned
ulimit -u, file descriptors, max_map_count — these parameters rarely appear in standard capacity dashboards. Monitor them actively and set alerts with safety margins (e.g., alert at 70% of the maximum limit).AWS Well-Architected Framework Analysis
Security
Indirect impact: Cognito failure compromised end-user authentication flows. Authentication dependencies without fallback became single points of failure for access to critical systems.
Reliability
Critical failure across multiple dimensions: absence of circuit breakers for OS resource limits, network topology with quadratic growth not tested at scale, circular dependencies between service and its observability system, and absence of fault isolation between data plane and control plane.
Sustainability
Not applicable as a causal factor in this incident.
Verdict: A Simple Incident with Complex Lessons
The November 2020 Kinesis incident is, on the surface, a story about a misconfigured ulimit. At depth, it is a story about how large-scale systems accumulate hidden dependencies that only become visible when they fail — and about how the failure of one component can paralyze the very diagnostic system that should have detected it.
AWS deserves credit for the detailed post-mortem and transparency about the root cause. But the incident also exposes a fundamental tension in the operating model of a cloud provider: the more you use your own services as internal infrastructure (dogfooding), the more you create couplings that can amplify local failures into regional collapses.
For engineers building on AWS — or any cloud provider — the central lesson is this: you cannot assume that your provider's control plane is independent of the data plane you use. Design your systems assuming any AWS service can fail, including observability services. Implement external synthetic health checks. Have authentication fallbacks. And never execute high-risk operational changes without an adequate stabilization window before critical business periods.
The thread limit was a configuration detail.