Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Post-mortemMicrosoft AzureResiliência

Azure (2026): GenAI Workload Overload and the Shared Blast Radius

May 29, 2026 10 min AI-assisted

Listen to study

generated on play

Generated only on first play

On demand

0:000:00

Speed

The MP3 is saved to S3 after the first play.

On May 29, 2026, Microsoft Azure experienced an availability incident tied to shared routing infrastructure saturated by first-party generative AI workloads. The event exposed the classic noisy neighbor risk at hyperscale and led Microsoft to migrate its own GenAI loads to dedicated routing planes. This analysis reconstructs the incident, evaluates the architectural decisions involved, and extracts lessons applicable to any platform hosting LLM inference on multi-tenant infrastructure.

When Microsoft placed its own generative AI workloads on the same routing infrastructure that serves external customers, it built a capacity bomb with a timer. On May 29, 2026, that timer hit zero — and the blast radius reached customers who had nothing to do with AI.

Incident Facts

Company / System: Microsoft Azure — global routing infrastructure
Incident Date: May 29, 2026
Primary Category: Capacity saturation / Noisy neighbor on shared routing
Affected Services: Multiple Azure services dependent on the shared routing layer; external customers indirectly impacted
Root Cause (summary): First-party (Microsoft) GenAI workloads consumed disproportionate capacity on shared routing infrastructure, degrading external customer services
Structural Response: Migration of large first-party GenAI loads to dedicated routing infrastructure (separate plane)
Primary Source: Azure Status History — microsoft.com
Severity Classification (estimated): High — broad availability degradation across multiple services

What Happened

On May 29, 2026, Microsoft Azure recorded an availability incident whose primary vector was the saturation of the shared routing layer by generative AI workloads operated by Microsoft itself — first-party loads, not external customers.

The dynamic is well-known in distributed systems but rarely manifests at this scale: a privileged tenant — in this case, the cloud operator itself — consumes shared infrastructure resources disproportionately, degrading service quality for all other tenants. The technical term is noisy neighbor, but here the 'noisy neighbor' was the building owner.

LLM inference workloads have a radically different traffic profile from traditional workloads. A single inference request can last from hundreds of milliseconds to several seconds, holding connections open, consuming routing buffers, and occupying processing slots for far longer than a conventional API call. When these workloads scale horizontally — as Microsoft's Copilot-based products and Azure OpenAI Service inevitably do — the impact on the routing layer is multiplicative, not linear.

Azure's routing infrastructure, designed for the traffic profile of traditional cloud services (short requests, high frequency, low connection latency), was not dimensioned to absorb this new pattern without affecting the shared data plane. The result was availability degradation in services with no direct relationship to AI — a blast radius that expanded laterally through the shared infrastructure.

Incident Timeline

1
Pre-incident — Accumulated Growth
In the months preceding May 2026, Microsoft's GenAI-based products (Copilot for Microsoft 365, Azure OpenAI Service, Bing AI, and others) experienced accelerated usage growth. These inference workloads were gradually scaled on Azure's shared routing infrastructure, incrementally increasing capacity pressure — below alert thresholds configured for traditional traffic patterns.
2
May 29, 2026 — Trigger Event
A demand spike in first-party GenAI workloads — possibly correlated with a feature launch, marketing campaign, or organic usage event — pushed shared routing layer utilization past the saturation point. Existing throttling mechanisms were insufficient to contain the impact within the GenAI domain.
3
Propagation — Lateral Blast Radius
Routing layer saturation began affecting non-AI Azure services. External customers reported availability degradation in services dependent on the same routing plane. The impact manifested as increased latency, timeouts, and intermittent failures — typical symptoms of resource contention in shared infrastructure.
4
Immediate Response — Operational Mitigation
Azure engineering teams applied immediate operational mitigations: aggressive throttling on first-party GenAI workloads, load redistribution across regions, and activation of reserve capacity. Degradation was gradually contained, but recovery time was significant given the scope of the impact.
5
Post-incident — Structural Decision
Microsoft made the architectural decision to migrate large first-party GenAI loads to dedicated routing infrastructure — a separate plane, isolated from the shared plane serving external customers. This change was publicly communicated as part of the incident response and represents a fundamental change in Azure's tenant isolation model for AI workloads.

Failure Flow: Shared vs. Dedicated Routing

The diagram reconstructs the PRE-incident state (shared plane, left) and the POST-remediation state (dedicated plane for GenAI, right). The dashed red arrow indicates the blast radius propagation vector in the prior state.

🌐 Clientes Externos / External Customers

Clientes Azure · External Customers

🤖 Workloads GenAI 1ª Parte / First-Party GenAI

Microsoft Copilot · M365 / Bing AI
Azure OpenAI · Service (interno)

⚠️ PRÉ-Incidente: Plano Compartilhado / PRE-Incident: Shared Plane

Roteamento · Compartilhado · (Shared Routing Plane)
Load Balancer · Compartilhado

☁️ Serviços Azure / Azure Services

Serviço Azure A · (ex: Storage)
Serviço Azure B · (ex: Compute)
Serviço Azure C · (ex: Databases)

✅ PÓS-Remediação: Plano Dedicado / POST-Remediation: Dedicated Plane

Roteamento · Dedicado GenAI · (Dedicated AI Plane)
Quota & · Throttling Engine · (AI-specific)
Frota de · Inferência LLM · (GPU Clusters)

Root Cause: The Operator as Noisy Neighbor

The root cause was not a bug, hardware failure, or external attack. It was an implicit design decision: placing extremely resource-intensive first-party workloads on the same routing plane that serves external customers, without adequate capacity isolation between these two domains. LLM inference workloads have three characteristics that make them especially dangerous in shared infrastructure: (1) long request latency — they hold resources allocated for seconds, not milliseconds; (2) asymmetric elasticity — they scale rapidly in response to demand, but routing infrastructure does not scale at the same speed; (3) implicit priority — first-party loads frequently have privileged access or less throttling than external customers, creating a structural imbalance. The result was a blast radius that propagated laterally: services with no relationship to AI were degraded because they shared the same routing layer with workloads that saturated it.

Why This Is Different from Previous Capacity Incidents

Capacity incidents in the cloud are not new. AWS, Azure, and GCP have all suffered degradations from shared resource saturation. But the May 2026 incident has characteristics that structurally distinguish it from precedents.

First, the load profile is qualitatively different. Traditional cloud workloads — VMs, containers, serverless functions, databases — have short-duration, high-frequency requests. Routing infrastructure was designed and dimensioned for this pattern. LLM inference breaks this fundamental assumption: a single text generation request can hold a connection open for 5-30 seconds while the model generates tokens. Multiply that by millions of simultaneous Copilot users and you have a traffic pattern that routing infrastructure simply was not designed to absorb.

Second, the growth scale was unprecedented. Adoption of Microsoft's GenAI products grew exponentially between 2023 and 2026. This growth was neither linear nor predictable in the same terms as traditional workload growth. Usage spikes correlated with external events (product launches, viral news, marketing campaigns) created load patterns that traditional capacity planning models did not adequately capture.

Third, tenant isolation was not designed for this scenario. Azure's multi-tenant architecture was built assuming that no individual tenant — including Microsoft itself — would consume a disproportionate fraction of shared routing capacity. This assumption was reasonable for traditional workloads. For GenAI workloads at mass consumer product scale, it proved incorrect.

This combination — new load profile, exponential growth, inadequate isolation — created the conditions for an incident that would not have been predicted by existing risk models. It is a reminder that architectural assumptions have expiration dates, especially when the load profile changes fundamentally.

Remediation: Structural Isolation of Routing Planes

Microsoft's response to the incident was surgical and structurally correct: separating the routing plane for first-party GenAI workloads from the shared plane serving external customers. This decision deserves detailed analysis because it is exactly the type of remediation that resolves the root cause, not just the symptoms.

What plane separation resolves:

By moving its own GenAI loads to dedicated routing infrastructure, Microsoft eliminates the lateral blast radius vector. An inference spike in Copilot can no longer saturate the routing that serves a customer running database or compute workloads. The two domains become independent in terms of routing capacity.

Additionally, separation allows each plane to be sized and optimized for its specific load profile. The GenAI plane can be configured to handle long connections, high token concurrency, and burst traffic patterns characteristic of inference. The external customer plane can maintain optimizations for traditional workloads.

What plane separation does not resolve alone:

Plane separation is necessary but not sufficient. External customers using Azure OpenAI Service directly still compete for inference capacity among themselves — the noisy neighbor problem shifts to the GPU/TPU layer, it does not disappear. Complete remediation also requires: (1) per-tenant quotas at the inference layer, not just routing; (2) adaptive throttling mechanisms that respond to burst patterns; (3) capacity planning specific to LLM load profiles, including tokens per second as a primary capacity metric.

Industry implications:

Microsoft's decision signals a paradigm shift that every cloud provider with AI ambitions will need to make: treating LLM inference workloads as a distinct resource class, with their own routing planes, capacity models, and isolation policies. Cloud infrastructure built for the pre-GenAI era is not adequate, without modification, for the era of massive inference.

Technical Lessons from the Incident

LLM inference is a new workload class: The profile of long requests, high concurrency, and asymmetric burst breaks the sizing assumptions of traditional routing infrastructure. Treat as a separate class from design time.

The operator as privileged tenant is an architectural risk: When the platform operator hosts its own intensive workloads on the same infrastructure serving customers, it becomes the largest noisy neighbor risk. Plane isolation is mandatory, not optional.

Lateral blast radius is the most underestimated risk in multi-tenancy: The impact was not confined to AI services — it propagated to services unrelated to AI. Map shared infrastructure dependencies and calculate the potential blast radius for each domain.

Capacity planning needs AI-native metrics: Tokens per second, average inference request duration, and session concurrency are the correct metrics for sizing GenAI infrastructure — not traditional RPS or byte throughput.

Throttling must be applied at the inference layer, not just routing: Separating routing planes resolves network blast radius, but not GPU/TPU blast radius. Per-tenant quotas at the inference layer are complementary and equally necessary.

Exponential growth invalidates linear capacity planning models: Mass-market GenAI products grow non-linearly and correlated with external events. Capacity planning models must incorporate extreme peak scenarios, not just historical averages.

Architectural Decision: Routing Plane Separation

Accepted

Context

After the May 2026 incident, Microsoft needed to decide how to isolate first-party GenAI workloads from the routing plane shared with external customers. Options included more aggressive throttling on the shared plane, per-workload quotas, or physical plane separation.

Decision

Migrate large first-party GenAI loads to dedicated routing infrastructure, physically separated from the shared plane serving external customers.

Consequences

✅ Eliminates the lateral blast radius vector between GenAI and other Azure services
✅ Allows independent optimization of each plane for its specific load profile
✅ Improves capacity predictability for external customers
⚠️ Increases operational and infrastructure cost (plane duplication)
⚠️ Does not resolve noisy neighbor at the inference layer (GPU/TPU) among external customers
⚠️ Requires complex migration of existing workloads without service interruption

Isolation Strategies for GenAI Workloads in Multi-Tenant

	Strategy	Routing Blast Radius	Inference Blast Radius	Operational Cost	Implementation Complexity
Shared Plane (pre-incident state)	High	High	Low	Low	—
Per-Workload Throttling on Shared Plane	Medium	High	Low	Medium	—
Dedicated Routing Plane (post-incident decision)	Eliminated	High	High	High	—
Dedicated Plane + Per-Tenant Inference Quotas	Eliminated	Low	Very High	Very High	—

My Senior Take: The Problem Nobody Wants to Admit

Senior Solutions Architect

What catches my attention in this incident is not the technical failure — it is the architectural governance failure that preceded it. Microsoft has world-class engineers. The noisy neighbor problem in shared infrastructure has been known for decades. Tenant isolation is a fundamental cloud design principle. And yet, first-party GenAI workloads — with a radically different load profile and exponential growth — were placed on the same routing plane serving external customers. This happens for a reason I see repeatedly in large organizations: product launch velocity outpaces infrastructure adaptation velocity. When Copilot needed to scale quickly, the fastest decision was to use existing infrastructure. Technical debt was silently accumulated until the day it collected its price. What I would do differently: In any platform hosting LLM inference at scale, I would treat inference workloads as a first-class resource class from day zero — with their own routing planes, capacity models based on tokens/second, per-tenant quotas at the GPU layer, and circuit breakers that isolate failure domains before blast radius propagates. Not as a future optimization, but as a non-negotiable architectural requirement. Additionally, I would implement an explicit 'blast radius budget' for each service domain: a formal analysis of which other services can be affected if that domain saturates its shared infrastructure. This budget would be reviewed at every significant architectural change — especially when new workload types are introduced. The most important lesson from this incident is not technical. It is organizational: capacity governance needs to carry the same weight as product delivery velocity. In technology companies, this is a difficult political battle. But incidents like this are the cost of losing that battle.

Well-Architected Framework Analysis

Security

Tenant isolation risk. Although not a security incident, the absence of capacity isolation between tenants (including the operator itself) is a tenant isolation risk with security implications in adversarial scenarios.

Reliability

Critical failure. The absence of failure domain isolation between GenAI workloads and customer services violated the blast radius containment principle. The remediation — routing plane separation — is the correct response, but should have been the initial design.

Performance efficiency

Invalid performance assumptions. The routing infrastructure performance model was built for short-duration workloads. LLM inference has a fundamentally different performance profile. Capacity planning and benchmarking need to be redone with AI-native metrics.

Verdict: The Price of Treating AI as Just Another Workload

The Azure incident of May 2026 is a case study in the cost of outdated architectural assumptions in an environment of accelerated technological change. The decision to host first-party GenAI workloads on the same routing plane serving external customers was not negligence — it was a pragmatic decision that made sense at a given moment, when scale was smaller and the load profile was not yet well understood. The problem is that this decision was not revisited as scale grew exponentially. What this incident proves: Cloud infrastructure built for traditional workloads needs to be fundamentally rethought to support LLM inference at mass consumer product scale. It is not a matter of adding more capacity to the existing plane — it is a matter of recognizing that LLM inference is a distinct resource class requiring first-class architectural isolation. The remediation was correct: Separating routing planes is the right decision. But it is only the first step. The noisy neighbor problem shifts to the inference layer (GPU/TPU), and the complete solution requires per-tenant quotas, adaptive throttling, and capacity planning with AI-native metrics at all stack layers. The lesson for the industry: Every cloud provider, every company building GenAI products on shared infrastructure, and every architect responsible for multi-tenant systems with AI workloads needs to ask the same question Microsoft was forced to answer: was my isolation infrastructure designed for the load profile I am running today, or for the load profile of five years ago? If the answer is the second option, the timer is already running.

References

Azure Status History — Microsoft Azure

#azure#genai#resiliencia#noisy-neighbor#capacity-planning#incident#blast-radius#inferencia

Case sources

Azure status history

Written with AI assistance from the public case and my architect's reading.

Ask Fernando about this

Get a focused answer about this study from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Post-mortemMicrosoft AzureResiliência

Azure (2026): GenAI Workload Overload and the Shared Blast Radius

May 29, 2026 10 min AI-assisted

Listen to study

generated on play

Generated only on first play

On demand

0:000:00

Speed

The MP3 is saved to S3 after the first play.

Incident Facts

Company / System: Microsoft Azure — global routing infrastructure
Incident Date: May 29, 2026
Primary Category: Capacity saturation / Noisy neighbor on shared routing
Affected Services: Multiple Azure services dependent on the shared routing layer; external customers indirectly impacted
Root Cause (summary): First-party (Microsoft) GenAI workloads consumed disproportionate capacity on shared routing infrastructure, degrading external customer services
Structural Response: Migration of large first-party GenAI loads to dedicated routing infrastructure (separate plane)
Primary Source: Azure Status History — microsoft.com
Severity Classification (estimated): High — broad availability degradation across multiple services

What Happened

Incident Timeline

1
Pre-incident — Accumulated Growth
In the months preceding May 2026, Microsoft's GenAI-based products (Copilot for Microsoft 365, Azure OpenAI Service, Bing AI, and others) experienced accelerated usage growth. These inference workloads were gradually scaled on Azure's shared routing infrastructure, incrementally increasing capacity pressure — below alert thresholds configured for traditional traffic patterns.
2
May 29, 2026 — Trigger Event
A demand spike in first-party GenAI workloads — possibly correlated with a feature launch, marketing campaign, or organic usage event — pushed shared routing layer utilization past the saturation point. Existing throttling mechanisms were insufficient to contain the impact within the GenAI domain.
3
Propagation — Lateral Blast Radius
Routing layer saturation began affecting non-AI Azure services. External customers reported availability degradation in services dependent on the same routing plane. The impact manifested as increased latency, timeouts, and intermittent failures — typical symptoms of resource contention in shared infrastructure.
4
Immediate Response — Operational Mitigation
Azure engineering teams applied immediate operational mitigations: aggressive throttling on first-party GenAI workloads, load redistribution across regions, and activation of reserve capacity. Degradation was gradually contained, but recovery time was significant given the scope of the impact.
5
Post-incident — Structural Decision
Microsoft made the architectural decision to migrate large first-party GenAI loads to dedicated routing infrastructure — a separate plane, isolated from the shared plane serving external customers. This change was publicly communicated as part of the incident response and represents a fundamental change in Azure's tenant isolation model for AI workloads.

Failure Flow: Shared vs. Dedicated Routing

🌐 Clientes Externos / External Customers

Clientes Azure · External Customers

🤖 Workloads GenAI 1ª Parte / First-Party GenAI

Microsoft Copilot · M365 / Bing AI
Azure OpenAI · Service (interno)

⚠️ PRÉ-Incidente: Plano Compartilhado / PRE-Incident: Shared Plane

Roteamento · Compartilhado · (Shared Routing Plane)
Load Balancer · Compartilhado

☁️ Serviços Azure / Azure Services

Serviço Azure A · (ex: Storage)
Serviço Azure B · (ex: Compute)
Serviço Azure C · (ex: Databases)

✅ PÓS-Remediação: Plano Dedicado / POST-Remediation: Dedicated Plane

Roteamento · Dedicado GenAI · (Dedicated AI Plane)
Quota & · Throttling Engine · (AI-specific)
Frota de · Inferência LLM · (GPU Clusters)

Root Cause: The Operator as Noisy Neighbor

Why This Is Different from Previous Capacity Incidents

Remediation: Structural Isolation of Routing Planes

What plane separation resolves:

What plane separation does not resolve alone:

Industry implications:

Technical Lessons from the Incident

Architectural Decision: Routing Plane Separation

Accepted

Context

Decision

Migrate large first-party GenAI loads to dedicated routing infrastructure, physically separated from the shared plane serving external customers.

Consequences

✅ Eliminates the lateral blast radius vector between GenAI and other Azure services
✅ Allows independent optimization of each plane for its specific load profile
✅ Improves capacity predictability for external customers
⚠️ Increases operational and infrastructure cost (plane duplication)
⚠️ Does not resolve noisy neighbor at the inference layer (GPU/TPU) among external customers
⚠️ Requires complex migration of existing workloads without service interruption

Isolation Strategies for GenAI Workloads in Multi-Tenant

	Strategy	Routing Blast Radius	Inference Blast Radius	Operational Cost	Implementation Complexity
Shared Plane (pre-incident state)	High	High	Low	Low	—
Per-Workload Throttling on Shared Plane	Medium	High	Low	Medium	—
Dedicated Routing Plane (post-incident decision)	Eliminated	High	High	High	—
Dedicated Plane + Per-Tenant Inference Quotas	Eliminated	Low	Very High	Very High	—

My Senior Take: The Problem Nobody Wants to Admit

Senior Solutions Architect

Well-Architected Framework Analysis

Security

Reliability

Performance efficiency

Verdict: The Price of Treating AI as Just Another Workload

References

Azure Status History — Microsoft Azure

#azure#genai#resiliencia#noisy-neighbor#capacity-planning#incident#blast-radius#inferencia

Case sources

Azure status history

Written with AI assistance from the public case and my architect's reading.

Ask Fernando about this

Get a focused answer about this study from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Listen to study

Incident Facts

What Happened

Incident Timeline

Pre-incident — Accumulated Growth

May 29, 2026 — Trigger Event

Propagation — Lateral Blast Radius

Immediate Response — Operational Mitigation

Post-incident — Structural Decision

Failure Flow: Shared vs. Dedicated Routing

Root Cause: The Operator as Noisy Neighbor

Why This Is Different from Previous Capacity Incidents

Remediation: Structural Isolation of Routing Planes

Technical Lessons from the Incident

Architectural Decision: Routing Plane Separation

Isolation Strategies for GenAI Workloads in Multi-Tenant

Well-Architected Framework Analysis

Security

Reliability

Performance efficiency

Verdict: The Price of Treating AI as Just Another Workload

References

Ask Fernando about this

Join the conversation

Listen to study

Incident Facts

What Happened

Incident Timeline

Pre-incident — Accumulated Growth

May 29, 2026 — Trigger Event

Propagation — Lateral Blast Radius

Immediate Response — Operational Mitigation

Post-incident — Structural Decision

Failure Flow: Shared vs. Dedicated Routing

Root Cause: The Operator as Noisy Neighbor

Why This Is Different from Previous Capacity Incidents

Remediation: Structural Isolation of Routing Planes

Technical Lessons from the Incident

Architectural Decision: Routing Plane Separation

Isolation Strategies for GenAI Workloads in Multi-Tenant

Well-Architected Framework Analysis

Security

Reliability

Performance efficiency

Verdict: The Price of Treating AI as Just Another Workload

References

Ask Fernando about this

Join the conversation