# Netflix: Chaos Engineering and Resilience by Design

An architectural reconstruction of how Netflix turned inevitable failures into deliberate practice — from the Simian Army to automated regional failover. This teardown examines the technical decisions, real trade-offs, and what separates genuine resilience from cosmetic high availability.

- URL: https://fernando.moretes.com/studies/netflix-chaos-engineering

- Markdown: https://fernando.moretes.com/studies/netflix-chaos-engineering/study.md?lang=en

- Type: Teardown

- Company: Netflix

- Domain: Resiliência

- Date: 2021-06-01

- Tags: chaos-engineering, resiliência, netflix, aws, microservices, failover, simian-army, distributed-systems

- Reading time: 7 min

---

In 2011, Netflix completed its migration to AWS and, in doing so, accepted an uncomfortable premise: infrastructure will fail. Their response was not to try to prevent failures — it was to build a system that absorbs them without the user noticing. The result is one of the most studied resilience architectures in the industry, and also one of the most misunderstood.

## Case Facts

- **Company:** Netflix, Inc.
- **Domain:** Video streaming, platform resilience
- **Scale (estimated):** ~260 million subscribers, peak >700 Gbps egress traffic
- **AWS Migration:** 2008–2011 (full migration from own datacenter)
- **Chaos Monkey — launch:** 2011 (open-sourced in 2012)
- **Simian Army:** 2011–2018 (suite of fault injection tools)
- **Primary stack:** AWS (EC2, S3, DynamoDB, Route 53), Java/Kotlin (microservices), Zuul (API Gateway), Eureka (service discovery), Hystrix (circuit breaker), Cassandra, Kafka
- **Active AWS regions:** 3 simultaneously active regions (us-east-1, us-west-2, eu-west-1 and others)
- **Availability SLO:** 99.99% declared availability for the streaming service

## The Problem: High Availability Is Not Resilience

Before 2008, Netflix operated two proprietary datacenters. In August of that year, a database corruption took down the DVD service for three days — an incident that exposed the fragility of a monolithic architecture where a single point of failure could paralyze everything. The decision to migrate to AWS was not merely operational; it was an architectural bet that the cloud, with its elasticity and geographic redundancy, would allow building something qualitatively different.

The real problem, however, was not the infrastructure. It was the culture and engineering surrounding it. Distributed systems in the cloud introduce a class of failures that traditional datacenters rarely experience in production: instances that silently disappear, latencies that explode asymmetrically, network partitions that isolate entire zones. The conventional response is to try to prevent these failures with passive redundancy — multiple AZs, synchronous replication, automatic failover configured but never tested.

Netflix rejected this approach. The central premise guiding their entire resilience architecture is: **if you don't test failures in production, you don't know whether your system survives them**. Untested redundancy is the illusion of redundancy. A failover that has never been exercised under real conditions will fail at the worst possible moment — when load is at peak, when the team is asleep, when the incident is already underway. This premise is not philosophical; it is operational. And it was this premise that generated the Simian Army.

## Reconstructed Resilience Architecture

Flow of a streaming request traversing multiple resilience layers, with Simian Army fault injection points and graceful degradation mechanisms.

### 👤 Client

- Client (TV/Mobile/Web) (user)

### 🌐 Edge / CDN

- Netflix CDN (Open Connect) (edge)
- Route 53 DNS Failover (network)

### 🔀 API Gateway Layer

- Zuul API Gateway (frontend)
- Eureka Service Discovery (network)

### ⚙️ Microservices (us-east-1 — Primary)

- Auth Service (stateless) (compute)
- Catalog Service (read-heavy) (compute)
- Recommendation Service (compute)
- Hystrix Circuit Breaker (security)

### 💾 Data Layer

- Cassandra (multi-region) (data)
- DynamoDB (session/state) (data)
- S3 (assets/fallback) (storage)
- Kafka (event stream) (messaging)

### 🐒 Simian Army (Fault Injection)

- Chaos Monkey Kills instances (ci)
- Chaos Kong Kills AZ/Region (ci)
- Latency Monkey Injects delay (ci)
- Conformity Monkey Best practices (ci)

### 🌍 Failover Region (us-west-2)

- Zuul (standby-active) (frontend)
- Microservices (active replica) (compute)
- Cassandra (replica) (data)

### Flows

- client -> cdn: video (bytes)
- client -> r53: DNS lookup
- r53 -> zuul: routes to primary region
- zuul -> eureka: resolve service
- zuul -> hystrix: all calls through CB
- hystrix -> auth: authentication
- hystrix -> catalog: catalog
- hystrix -> reco: recommendations
- catalog -> cassandra: read
- auth -> dynamo: session
- reco -> s3: static fallback
- catalog -> kafka: usage events
- chaos_monkey -> auth: terminates instance
- chaos_monkey -> catalog: terminates instance
- chaos_kong -> zuul: removes entire region
- latency_monkey -> reco: injects 2-5s latency
- r53 -> zuul_dr: automatic DNS failover
- zuul_dr -> services_dr: redirected traffic
- services_dr -> cassandra_dr: replicated data
- cassandra -> cassandra_dr: async replication

## How the System Works: Layers of Resilience

Netflix's resilience architecture is not a single product — it is a composition of independent mechanisms that reinforce each other. Understanding each layer separately is essential to understanding why the whole works.

**Layer 1 — Graceful Degradation with Hystrix.** Every service that calls external dependencies is wrapped by a Hystrix circuit breaker. When a dependency starts failing or exceeding configured timeouts, the circuit opens and the call immediately returns a fallback value — usually cached data, a static response, or a degraded version of the feature. The canonical example is the recommendation service: if it fails, the user receives a pre-computed list of popular films instead of personalized recommendations. The experience is worse, but the service keeps running. This pattern — *fail fast, fallback gracefully* — is applied systematically across hundreds of microservices.

**Layer 2 — Dynamic Service Discovery with Eureka.** Eureka is Netflix's internal service registry. Each instance registers on startup and sends periodic heartbeats. When an instance dies — whether from real failure or Chaos Monkey action — it is removed from the registry within seconds. Zuul, the API Gateway, queries Eureka to route requests only to healthy instances. There is no long DNS TTL to wait for; convergence is fast. This means the death of an individual instance is absorbed by the routing layer without human intervention.

**Layer 3 — Regional Failover with Chaos Kong.** The highest level of resilience is regional failover. Netflix operates in multiple AWS regions simultaneously — not in active-passive mode, but active-active with the capacity to absorb an entire region's traffic. Chaos Kong simulates the complete loss of a region: it diverts all traffic from, say, us-east-1 to us-west-2 via Route 53, and the team observes whether the system holds. This exercise is performed periodically in production. Cassandra, the primary database for catalog and profile data, operates with multi-region replication configured for eventual consistency — a deliberate choice that prioritizes availability over strong consistency (CAP theorem, AP partition).

**Layer 4 — The Simian Army as Engineering Culture.** The Simian Army is not just a toolset; it is the institutionalization of a culture. Chaos Monkey runs during business hours (not at midnight) deliberately — if it is going to cause a problem, it is better for it to happen when the team is awake and can respond. Conformity Monkey checks whether instances follow configured best practices. Security Monkey audits security configurations. Doctor Monkey performs health checks. Each monkey attacks a different dimension of resilience, and together they create continuous pressure for teams to build services that survive failures rather than depend on failures not happening.

## The Principles of Chaos Engineering: What Netflix Formalized

In 2014, Netflix published the *Principles of Chaos Engineering* document — an attempt to formalize what was being practiced empirically. The principles are five, and each has direct architectural implications worth examining carefully.

**1. Build a hypothesis around steady-state behavior.** Before injecting any failure, you need to define what normal looks like. For Netflix, this means business metrics — *stream starts per second* (SPS) is the primary metric, not CPU or latency. If SPS drops, something is wrong. This metric choice is architecturally important: it forces instrumentation to be oriented toward business outcomes, not internal technical state.

**2. Vary real-world events.** The injected failures must reflect what actually happens: instances that die, disks that fail, network latencies that increase, external dependencies that slow down. It is not about creating artificial scenarios — it is about reproducing in controlled conditions what production already experiences in an uncontrolled way.

**3. Run experiments in production.** This is the most controversial and most important principle. Staging environments do not replicate real load, real traffic patterns, real data. A bug that only appears with 10 million simultaneous requests will not appear in staging. Netflix accepts the risk of production degradation as the cost of having genuine confidence in resilience.

**4. Automate experiments to run continuously.** Resilience is not a state you achieve once; it is a property that must be maintained. Each new deploy can introduce a resilience regression. Chaos Monkey running continuously ensures the system is tested under failure conditions after every change.

**5. Minimize the blast radius.** Experiments start small — one instance, one AZ, one percentage of traffic. The blast radius is expanded gradually as confidence increases. This is what allows running in production without systematically causing major incidents.

What Netflix did was transform these principles into executable infrastructure. The distance between the principle and the implementation is where most organizations fail — they adopt the vocabulary of chaos engineering without building the fallback mechanisms that make experiments safe to run.

## Central Architectural Trade-offs

### Eventual Consistency (AP) vs. Strong Consistency (CP)

**Pros**
- Maximum availability even during network partitions
- Predictable read latency without cross-replica coordination
- Regional failover without write blocking

**Cons**
- User may see slightly stale data (profile, history)
- Write conflicts in split-brain scenarios require resolution

**Verdict:** Correct for the domain: streaming tolerates seconds of staleness; unavailability is not tolerated.

### Multi-Region Active-Active vs. Active-Passive

**Pros**
- Failover without cold-start: target region is already warm
- Real traffic continuously validates secondary region capacity

**Cons**
- ~2x operational cost: full capacity must exist in multiple regions
- Data synchronization complexity across regions

**Verdict:** Justified by the 99.99% SLO: the cost of downtime exceeds the cost of active redundancy.

### Chaos in Production vs. Chaos in Staging

**Pros**
- Tests the real system with real load and real data
- Detects resilience regressions introduced by new deploys

**Cons**
- Risk of experience degradation for real users
- Requires high operational maturity: fallbacks must exist before experiments

**Verdict:** Correct for Netflix; dangerous without graceful degradation mechanisms already in place.

### Automatic Circuit Breaker vs. Simple Timeout

**Pros**
- Prevents failure cascade: a slow service does not bring down upstream
- Automatic recovery without human intervention

**Cons**
- Threshold configuration is hard: too sensitive generates false positives
- Fallback must be explicitly implemented by each service

**Verdict:** Indispensable in microservices architecture with hundreds of dependencies.

## Well-Architected Read

- **security**: **Adequate, but not the primary focus of this architecture.** Security Monkey (part of the Simian Army) continuously audits security configurations. The microservices architecture with Zuul as a centralized entry point facilitates the application of authentication and authorization policies. The historical blind spot was the broad attack surface of an architecture with hundreds of microservices — each potentially with its own security configuration.
- **reliability**: **Excellent, with nuances.** Netflix is the reference case for the reliability pillar. Multiple AZs, multiple active regions, circuit breakers, automated health checks, dynamic service discovery, and continuous chaos engineering. The only caveat is that eventual consistency introduces inconsistency windows that would be unacceptable in financial domains — but for streaming, it is the correct choice.
- **sustainability**: **Not explicitly addressed in public literature.** The use of Spot Instances for encoding workloads reduces capacity waste. The proprietary CDN (Open Connect) installed at ISPs reduces the physical distance of bytes, which has an indirect energy benefit. But active redundancy across multiple regions implies significantly higher energy consumption than an active-passive architecture.

> **What I'd Do Differently — and What I'd Take to Any Project:** Netflix's architecture is genuinely impressive, but there are three things I would question or adjust, and one lesson I apply to every resilience project.

**What I'd question:** Hystrix was deprecated in 2018, and Netflix migrated to Resilience4j and service mesh-based solutions (Envoy/Istio). The lesson here is that application-layer circuit breakers have a fundamental problem: they are invisible to the infrastructure. A service mesh solves this by moving resilience logic to the sidecar, making it observable and configurable without code changes. If I were designing today, I would start with Envoy as a sidecar and not with Hystrix in the application.

**What I'd adjust in the Simian Army:** The Simian Army was discontinued as a unified project. The tools were replaced by more modern solutions integrated with the AWS ecosystem (AWS Fault Injection Simulator, for example). The problem with the original Simian Army was that each monkey was an independent service with its own lifecycle — hard to maintain, hard to coordinate. A modern approach would use FIS with hypotheses defined as code (IaC), integrated into the CI/CD pipeline, with automatic rollback if the steady-state metric degrades beyond a threshold.

**What I'd do differently on consistency:** For user profile and preference data, I would explore CRDTs (Conflict-free Replicated Data Types) instead of ad-hoc conflict resolution. CRDTs guarantee eventual convergence without conflicts — a more elegant choice than 'last write wins' for user data.

## Verdict

Netflix built the most influential resilience architecture of the last decade — not because it is perfect, but because it is honest. It starts from a premise that most organizations avoid admitting: failures are inevitable, and the only way to know whether you survive them is to test them deliberately. The result is a system where each layer — from the individual circuit breaker to regional failover — exists because it was validated under real conditions, not because it looked good on a diagram.

What makes this architecture hard to replicate is not the technology — Hystrix, Eureka, Cassandra are open-source. What is hard to replicate is the culture that accepts running Chaos Monkey in production during business hours, that measures resilience in SPS rather than instance uptime, that treats every failure as an experiment rather than an incident of shame. That culture is the most valuable architectural product Netflix has built.

For teams that want to learn from this case: do not start with Chaos Monkey. Start with fallbacks. Implement graceful degradation in every critical service, define your steady-state metric, and only then begin injecting failures.

## References

- [Netflix Tech Blog](https://netflixtechblog.com/)
- [Principles of Chaos Engineering](https://principlesofchaos.org/)
- [Netflix Tech Blog: The Netflix Simian Army (2011)](https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116)
- [Netflix Tech Blog: Chaos Engineering Upgraded (2018)](https://netflixtechblog.com/chaos-engineering-upgraded-878d341f15fa)
- [Netflix Tech Blog: Active-Active for Multi-Regional Resiliency](https://netflixtechblog.com/active-active-for-multi-regional-resiliency-c47719f6685b)
- [Netflix Tech Blog: Hystrix — Latency and Fault Tolerance](https://netflixtechblog.com/introducing-hystrix-for-resilience-engineering-13531c1ab362)
- [AWS Fault Injection Simulator (FIS) Documentation](https://docs.aws.amazon.com/fis/latest/userguide/what-is.html)
- [Netflix Tech Blog: Eureka — Service Discovery at Netflix](https://netflixtechblog.com/eureka-the-netflix-service-discovery-framework-the-heart-of-mid-tier-load-balancing-6a23c6a3dbb6)

## Case sources

- [Netflix Tech Blog](https://netflixtechblog.com/)
- [Principles of Chaos Engineering](https://principlesofchaos.org/)
