Teardown: Resilient Network Graphs and the Next-Generation AI Network
Listen to study
generated on playGenerated only on first play
An in-depth architectural analysis of the resilient graph-based data center networks AWS is building to support AI workloads at scale — covering topology, congestion control, energy efficiency, and the trade-offs that define the next generation of cloud infrastructure.
Training a frontier language model is not a compute problem — it is a network problem. When tens of thousands of accelerators need to exchange gradients in microseconds, the switch topology, routing algorithm, and congestion control policy determine whether the job finishes in days or in weeks. AWS is redesigning its data center network from the ground up for this regime. This teardown reconstructs the architecture, examines the design decisions, and evaluates what works, what is risky, and what I would do differently.
Fact Sheet
- Company / System
- Amazon Web Services — AI Data Center Network
- Domain
- Network infrastructure / AI workloads
- Declared scale
- Tens of thousands of accelerators per cluster; 400 Gbps to 3.2 Tbps per port (roadmap estimate)
- Core stack
- Fat-tree / multi-stage Clos topology, adaptive ECMP, RDMA over Converged Ethernet (RoCEv2), custom silicon switches (Nitro, Trainium, Inferentia), graph-based routing
- Sustainability
- 100% renewable energy goal (achieved globally in 2023); average PUE ~1.2 in new data centers
- Main public reference
- AWS Architecture Center, Amazon Sustainability Report 2023
- Analysis type
- Architectural teardown — reconstruction from public sources
The Problem: Why Conventional Data Center Networking Breaks Under AI
For most of cloud computing history, data center networks were designed for microservice east-west traffic: many short connections, reasonable tolerance for variable latency, and relatively unpredictable traffic patterns that benefit from static ECMP. The model works well for OLTP, video streaming, and REST APIs. It fails categorically for distributed AI model training.
The reason is structural. In a collective All-Reduce training job — the dominant pattern in frameworks like PyTorch with NCCL — all workers must synchronize gradients at every optimization step. This generates an incast traffic pattern: hundreds or thousands of flows converge simultaneously on the same aggregation switches. Buffers overflow. Packets are dropped. NCCL enters retransmission. The job that should finish in 12 hours finishes in 18 — or doesn't finish at all.
The congestion problem is compounded by three AI-specific factors: (1) tensor size — a single All-Reduce on a 70B parameter model can move hundreds of gigabytes per step; (2) synchronicity — unlike web traffic, workers fire simultaneously because SGD is synchronous by design; (3) jitter sensitivity — tail latency kills efficiency because the step only advances when the last worker finishes. A single congested link can degrade GPU utilization from 90% to 40%.
Beyond congestion, there is the failure topology problem. In a classic Clos network with static ECMP, a spine switch failure can create load asymmetry that the data plane does not detect quickly. For long-running workloads (multi-day jobs), this means silent degradation — the job continues, but slowly, with no obvious alarm. Detecting and re-routing in seconds, not minutes, is a production requirement.
Finally, there is the energy cost dimension. A 16,000-GPU H100 cluster consumes on the order of 50-80 MW in compute alone. The interconnect network adds 10-20% on top of that. High-radix switches with deep buffers and high-speed SerDes are power-hungry. The choice of topology, port speed, and buffer policy has a direct impact on PUE and the energy bill — and therefore on Amazon's sustainability goals.
Reconstructed Architecture: Data Center Network for AI Workloads
Reconstruction based on AWS public sources. Represents an AI training cluster with multi-stage Clos topology, graph-based control plane, and integration with the sustainability stack.
- GPU/Trainium · Worker Rack 0
- GPU/Trainium · Worker Rack 1
- GPU/Trainium · Worker Rack N
- ToR Switch · Rack 0 (400G)
- ToR Switch · Rack 1 (400G)
- ToR Switch · Rack N (400G)
- Aggregation · Switch A0
- Aggregation · Switch A1
- Spine Switch · S0 (Custom ASIC)
- Spine Switch · S1 (Custom ASIC)
- Spine Switch · S2 (Custom ASIC)
- Graph-Based · Control Plane
- In-Band · Telemetry (INT)
- Adaptive Congestion · Controller (DCQCN)
- PUE Monitor · & Power Capping
- Renewable Energy · Matching (100%)
- S3 / EFS · Checkpoint Store
How It Works: Topology, Graphs, and Congestion Control
The core architecture is a multi-stage Clos network — specifically a three-layer fat-tree variant: ToR (Top-of-Rack), Aggregation, and Spine. This is not novel; Clos has been the industry standard since the 2000s. What differentiates AWS's approach for AI is what happens above the data plane.
Graph-based control plane. The network is modeled as a weighted directed graph where nodes are switches and edges are links with attributes for capacity, current utilization, and health state. A centralized controller (analogous to an SDN controller, but with graph semantics) maintains this representation in memory and runs shortest-path algorithms with capacity constraints — essentially a Dijkstra or Bellman-Ford variant with multiple objectives: minimize tail latency, maximize bisection throughput, and avoid degraded links. When in-band telemetry (INT) reports a link above 70% utilization, the controller recalculates alternative routes and injects them into the data plane via P4 or OpenConfig, without human intervention. The target convergence time is sub-second — critical for training jobs that cannot pause.
RoCEv2 and DCQCN. Data transport between GPUs uses RDMA over Converged Ethernet (RoCEv2), which eliminates kernel memory copy overhead and reduces latency to tens of microseconds. The classic RoCEv2 problem is that it is lossless by design — it uses PFC (Priority Flow Control) to pause upstream transmitters when buffers fill. Misconfigured PFC generates PFC deadlock, where pauses propagate through the network and stall unrelated traffic. AWS mitigates this with DCQCN (Data Center Quantized Congestion Notification), a congestion control algorithm based on ECN that reduces the sender's injection rate before the buffer overflows, keeping the network in a non-lossy regime without relying on PFC as the primary mechanism.
In-band telemetry (INT). Each packet carries latency and queue utilization metadata collected by every switch in the path. The controller aggregates this data in real time to build a network heat map. This is fundamentally different from periodic SNMP polling — the temporal granularity is per-flow, not per-interface, and the observability latency is milliseconds, not minutes. For a training job with All-Reduce every 100ms, this makes a real operational difference.
Checkpointing and job resilience. The network alone does not guarantee job resilience. AWS integrates periodic checkpointing to S3/EFS directly into training frameworks (via SageMaker or Trainium SDK). If a node fails or a link degrades to the point of job abort, training restarts from the last checkpoint, not from scratch. Checkpoint frequency is a trade-off: more frequent checkpoints reduce lost work but increase network traffic and I/O cost — especially relevant for 100B+ parameter models where a single checkpoint can be hundreds of GB.
Architectural Decision Matrix
Clos Fat-Tree vs. 3D Torus (HPC-style)
- Uniform bisection bandwidth — any node can talk to any other at line rate
- Fault tolerance via multiple parallel paths (ECMP)
- Horizontal scaling without topology redesign
- Cabling and switch cost grows O(n log n) — more expensive than torus for very large clusters
- Higher hop-count latency than direct torus for nearest-neighbor communication
Correct for multi-tenant cloud. Torus only makes sense in dedicated pure-HPC clusters.
DCQCN (ECN-based) vs. pure PFC
- Avoids PFC deadlock — the biggest operational risk in RoCEv2 networks
- Proactive, not reactive, congestion control
- Compatible with conventional TCP traffic on the same fabric
- Requires ECN support on all switches in the path — hardware lock-in
- Parameter tuning (Kmin, Kmax, g) is sensitive to the specific traffic profile
Correct decision. Pure PFC is a time bomb in production with AI incast.
Custom silicon (Nitro, Trainium) vs. merchant silicon (Broadcom Tomahawk)
- Full control of feature roadmap — can optimize for specific AI traffic patterns
- Deep integration between NIC, switch, and accelerator (collective offload)
- Sustainable competitive advantage — competitors do not have access to the same silicon
- Silicon R&D CapEx is in the billions — prohibitive entry barrier for anyone except hyperscalers
- Hardware iteration cycle of 2-3 years vs. software that can be updated in weeks
Makes sense for AWS. For any other company, merchant silicon with differentiated software is the path.
Centralized SDN controller vs. pure distributed routing
- Global network view — can optimize routes with complete state information
- Faster convergence on complex failures (multiple simultaneous links)
- The centralized controller is a logical single point of failure — requires careful HA
- Control latency (RTT to controller) can be problematic for sub-second re-routing in very large clusters
Hybrid is the state of the art: distributed for fast local reactions, centralized for global optimization.
Sustainability as an Architectural Constraint, Not Marketing
Amazon achieved 100% renewable energy in 2023 — ahead of the original 2025 target. This is relevant to this teardown not as an ESG footnote, but because sustainability is becoming a first-class design constraint in AI network architecture.
The argument is straightforward: a 10,000-accelerator AI training cluster consumes on the order of 30-60 MW. The interconnect network represents 10-20% of that consumption — meaning 3-12 MW in switches and transceivers alone. The choice between 112G PAM4 vs. 56G NRZ SerDes for 400G links has a direct impact on consumption per bit. The choice of deep buffers (necessary to absorb incast) vs. shallow buffers (more energy-efficient) is an explicit trade-off between network efficiency and energy consumption.
AWS is addressing this on three fronts. First, PUE of ~1.2 in new data centers — achieved via direct liquid cooling in GPU racks and airflow optimization. For context: a PUE of 1.2 means 83% of energy goes to compute; a PUE of 1.5 (typical of legacy data centers) means only 67%. The difference in a 50 MW cluster is 8 MW — enough to power a small city.
Second, dynamic power capping integrated with the network controller. When data center power demand approaches a limit, the controller can reduce clock frequencies of non-critical switches or consolidate traffic on fewer physical links (shutting down the rest). This is analogous to what processors do with DVFS, but applied to the network.
Third, co-location with renewable sources. AWS is building data centers near solar and wind farms to minimize transmission losses and ensure temporal matching of renewable energy — not just annual credits (RECs), but real hourly matching. This influences region location choices and, indirectly, inter-region network latency.
The point I want to make clear: for systems architects designing AI workloads, network energy efficiency is not the AWS infrastructure team's problem — it is a parameter that affects cost per trained token and therefore the economic viability of the product. Choosing instances with more efficient networking (Trainium2 vs. P4d, for example) is an architectural decision with measurable financial impact.
AWS Well-Architected Framework Read
Security
Strong, but with expanded attack surface. RoCEv2 operates at L2/L3 with kernel bypass — the traditional host-firewall security model does not apply directly. AWS isolates training clusters in dedicated VPCs with placement groups and granular security groups. The Nitro System ensures the hypervisor has no access to training data. The residual risk is lateral movement within the same cluster — mitigated by per-job network segmentation.
Reliability
The architecture's strongest point. Multiple parallel paths (ECMP with 64+ paths in fat-tree), sub-second re-routing via graph controller, and automatic job checkpointing create layered resilience. A single spine switch failure should not degrade throughput by more than 1/N (where N is the number of spines). The main risk is correlated failure — a firmware bug affecting all switches of the same model simultaneously. AWS mitigates this with rolling firmware updates and network canary deployments.
Performance efficiency
Optimized for the use case, with explicit trade-offs. RoCEv2 + DCQCN delivers tens-of-microseconds latency and near-line-rate throughput under normal conditions. The bottleneck in large clusters is the collective All-Reduce — specifically the scatter-reduce phase that saturates ToR uplinks. AWS addresses this with collective offload in Trainium hardware (EFA with custom NCCL), moving part of the All-Reduce computation to the NIC and reducing network traffic. The estimated result is a 20-40% reduction in training step time for large models (estimate based on public EFA vs. TCP benchmarks).
Sustainability
Real and measurable differentiator. PUE of ~1.2, 100% renewable energy (real matching, not just RECs), and dynamic power capping integrated with the network controller place AWS among the most energy-efficient data center infrastructures in the world. The impact for the customer is direct: training a model on AWS has a materially lower carbon footprint than on-premises in a legacy data center. This is becoming a procurement criterion for companies with net-zero targets.
1. I would bet earlier on non-Clos graph topologies for dedicated AI clusters. The fat-tree Clos is excellent for multi-tenancy and general traffic, but for dedicated training clusters (where you know exactly which jobs run where), topologies like Dragonfly+ or Slim Fly offer lower network diameter and lower cabling cost for the same bisection bandwidth. AWS is likely exploring this internally (there are signals in research papers), but public communication is still dominated by the Clos narrative. If I were designing a 50,000+ accelerator cluster dedicated to training, I would seriously evaluate Dragonfly+ with graph-based adaptive routing. 2. I would invest heavily in correlated observability before scaling. The hardest problem I see in production is not congestion — it is debugging. When a training job is 30% slower than expected, the cause could be: (a) congestion on a specific link, (b) a worker with a degraded GPU, (c) imbalance in model partitioning, (d) checkpoint throttling due to I/O. Without automatic correlation of network + GPU + OS metrics, you spend hours on hypotheses. I would build a unified observability pipeline (OpenTelemetry with job context propagated down to the network packet level) before scaling beyond 1,000 nodes. 3. I would treat the SDN controller as a mission-critical service from day 1. I see teams that deploy SDN controllers with the same discipline as a low-criticality internal service. For a production AI network, the controller is as critical as a payments database. This means: active-active HA with regularly tested failover, chaos engineering on the network (controlled link failure injection in production), and detailed runbooks for every controller failure mode. Most teams don't do this until after the first serious incident. 4. I would physically separate checkpoint traffic from All-Reduce traffic. Using the same high-speed network for All-Reduce (latency-critical) and checkpointing (throughput-critical, latency-tolerant) creates contention at checkpoint moments. I would design a separate storage network (25G or 100G) dedicated to checkpointing, freeing the 400G+ network exclusively for gradient traffic. The additional cabling cost is marginal compared to the impact of a checkpoint that degrades training throughput for 10-15 seconds every hour.
Impact for Systems Architects: What This Means in Practice
This teardown is not just about what AWS does internally. The architectural decisions of AWS's network infrastructure propagate directly to the choices systems architects make when designing AI workloads in the cloud.
Instance choice is not just about FLOP/$. The instance's network topology matters as much as the number of accelerators. A P4de instance with EFA in a cluster placement group has fundamentally different network behavior from a G5 instance in a generic AZ. For distributed training jobs above 8 GPUs, instance and placement group selection should be guided by analysis of the model's communication pattern — not just compute cost.
Model partitioning has network implications. Tensor parallelism (TP) generates high-frequency all-to-all traffic between GPUs on the same node — ideal for NVLink, not Ethernet. Pipeline parallelism (PP) generates point-to-point traffic between stages — tolerant of higher latency. Data parallelism (DP) generates periodic All-Reduce — the case for which the network described in this teardown is optimized. An architect who understands network topology can choose the parallelism strategy that minimizes the communication bottleneck for their specific model.
Network monitoring is part of the training SLA. In financial systems, I would never accept an availability SLA without network latency monitoring. The same principle applies to AI clusters: the job completion SLA must include network metrics (link utilization, RDMA retransmission rate, All-Reduce latency). CloudWatch with EFA metrics and VPC Flow Logs are the minimum; INT via tools like AWS Network Manager is the state of the art.
Network cost is visible and optimizable. Cross-AZ traffic has an explicit cost on AWS. A poorly partitioned training job that generates cross-AZ traffic can have network costs that exceed compute costs. This is an architecture bug, not an infrastructure problem. The solution is cluster placement groups (guarantees co-location in the same AZ and ideally the same rack) and traffic analysis before scaling.
Verdict
AWS's data center network architecture for AI is technically sound and represents the state of the art in public cloud infrastructure. The combination of fat-tree Clos topology with a graph-based control plane, RoCEv2 with DCQCN, per-flow in-band telemetry, and deeply integrated custom silicon is coherent, well-grounded in academic research, and addresses the real problems of congestion, resilience, and energy efficiency that define the cost of AI training at scale. The strengths are clear: resilience by design (multiple paths, sub-second re-routing, automatic checkpointing), measurable energy efficiency (PUE ~1.2, 100% renewable), and deep network-accelerator integration (EFA, collective offload in Trainium). These are not marketing differentiators — they are architectural advantages with direct impact on cost per trained token. The gaps I identify are primarily in observability and operations: tooling to correlate network metrics with training job metrics is still maturing, and the operational complexity of DCQCN tuning and RoCEv2 configuration is underestimated by most teams adopting these technologies. The centralized SDN controller, if not treated with mission-critical system discipline, is a real risk. For systems architects designing AI workloads: AWS's network infrastructure is good enough for most use cases. The competitive differentiator is not choosing the cloud provider with the best network — it is understanding the topology deeply enough to make model partitioning, placement, and monitoring choices that extract the maximum from that infrastructure. That understanding is rare, and that is where real architectural value lies.