Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Post-mortemAtlassianDeploy/Dados

Atlassian 2022: when a maintenance script deletes 400 customers for two weeks

Apr 5, 2022 9 min AI-assisted

Listen to study

generated on play

Generated only on first play

On demand

0:000:00

Speed

The MP3 is saved to S3 after the first play.

In April 2022, a poorly parameterized maintenance script executed hard deletes on ~400 Atlassian Cloud customer sites, including Jira, Confluence, and other products. The absence of soft-delete, lack of mandatory dry-run, and nonexistent mass-restore tooling turned a minutes-long operational mistake into up to 14 days of downtime. This post-mortem examines the failure chain, the real blast radius, and the structural lessons every platform team should internalize.

Incident Facts

Company: Atlassian
Start date: April 5, 2022
Total recovery duration: Up to 14 days for the most affected customers
Impacted customers: ~400 sites (approx. 0.18% of the cloud customer base)
Affected products: Jira Software, Jira Service Management, Jira Work Management, Confluence, Atlassian Access, Statuspage, Atlassian Access
Failure type: Accidental hard delete via maintenance script with incorrect IDs
Primary root cause: Script received a list of IDs from a legacy app but deleted active production site IDs
Relevant stack: Atlassian Cloud (multi-tenant SaaS), internal provisioning and deprovisioning systems
Official source: Post-Incident Review published by Atlassian Engineering

Operational mistakes happen. What separates a contained incident from a two-week catastrophe is not the absence of the mistake — it is the absence of the defense layers that should have existed before, during, and after it. The Atlassian incident of April 2022 is a textbook of what not to do when operating a multi-tenant SaaS platform at scale: a script without mandatory dry-run, hard deletes without soft-delete as a safety net, and the late discovery that restoring 400 sites individually, without adequate tooling, would take weeks. This post-mortem is not about blame — it is about architecture.

What happened

Atlassian operates a continuous data cleanup process to remove instances of legacy applications that have been discontinued or disconnected by customers. On April 5, 2022, the responsible team executed a maintenance script with the goal of deprovisioning a specific set of instances of a legacy app — Atlassian Insight (later integrated into Jira Service Management). The script was part of a broader product consolidation process.

The core problem was one of parameterization: the script received as input a list of legacy app instance IDs to be removed, but the execution mechanism interpreted those IDs in the wrong context — applying them as identifiers of active customer sites on the Atlassian Cloud platform. The result was the execution of hard deletes on approximately 400 production sites, wiping data across multiple products per site: Jira Software, Jira Service Management, Confluence, Atlassian Access, and others.

The term "hard delete" here is precise and critical: this was not a logical marking for future removal, nor a soft-delete reversible by administrative operation. The data was permanently deleted from primary storage systems. There was no immediate rollback mechanism. Recovery would depend entirely on backups — and restoring backups, per site, manually, for hundreds of customers, proved to be the true bottleneck of the incident.

Timeline

1
Apr 5 — Script executed
Maintenance team executes deprovisioning script for the legacy Insight app. The script receives incorrect IDs and executes hard deletes on ~400 active customer sites. Execution occurs without a prior dry-run in production.
2
Apr 5 — Initial detection
Customers begin reporting total unavailability of their sites. Internal alerts fire. The engineering team identifies that entire sites have been removed — not just specific app instances.
3
Apr 5–6 — Damage scope assessed
Atlassian confirms ~400 affected sites. It is discovered that no automated mass-restore process exists. Each site must be restored individually from backups, requiring intensive manual engineering work.
4
Apr 6–10 — Public communication and triage
Atlassian publishes updates on the status page and begins directly contacting affected customers. The company prioritizes customers with the highest impact (larger user counts, enterprise contracts). Public criticism grows due to the lack of a clear ETA.
5
Apr 10–18 — Phased restoration
Engineering assembles a dedicated task force. The restore process is partially automated over the days, but still requires manual validation per site. Most customers are restored between April 10 and 18.
6
Apr 18 — Last customer restored
Approximately 14 days after the incident began, the last affected site is restored. Atlassian confirms no data was permanently lost — all data was recovered from backups, but downtime ranged from hours to two weeks.
7
Apr 2022 — Post-Incident Review published
Atlassian publishes its official post-incident review, detailing causes, timeline, and planned corrective actions. The document is notably transparent by industry standards.

Failure Flow: How the Script Deleted Production Sites

The diagram reconstructs the execution flow of the maintenance script and how the ID confusion propagated hard deletes to the storage systems of multiple products. Dashed arrows indicate the path that should have existed (dry-run, soft-delete) but did not.

🛠️ Operação de Manutenção / Maintenance Operation

Engenheiro · de Manutenção
Script de · Desprovisionamento
Lista de IDs · (Insight legado)

⚙️ Plataforma de Provisionamento / Provisioning Platform

API de · Provisionamento
Registro de Sites · (IDs de produção)
Motor de · Deleção

💾 Armazenamento de Dados / Data Storage

Jira Software · + JSM Data
Confluence · Data
Atlassian Access · + outros
Backups · (restauração manual)

🚫 Controles Ausentes / Missing Controls

Dry-Run · (não existia)
Soft-Delete · (não implementado)
Restore em Massa · (não existia)

Root Cause: Three Absences, Not One Mistake

The immediate root cause was ID namespace confusion: the script treated instance IDs from a legacy app as production site IDs, and the deletion engine did not validate the context. But the true systemic root cause is the absence of three independent defense layers: 1. No mandatory dry-run: The script was executed directly in production without a simulation phase that would list what would be deleted without actually deleting it. A dry-run would have exposed the error before any damage. 2. No soft-delete: The deprovisioning system executed hard deletes directly. A soft-delete pattern — marking the resource as PENDING_DELETION with a 24–72h retention window — would have allowed immediate reversal after error detection. 3. No mass-restore tooling: Atlassian had backups. The problem was not the absence of recoverable data — it was the absence of infrastructure to restore hundreds of sites in parallel. The manual process was the time multiplier that turned a hours-long incident into a weeks-long one.

The Real Blast Radius and Why It Took Two Weeks

It is tempting to reduce this incident to "0.18% of customers" and move on. But blast radius must be measured in terms of impact per affected customer, not just as a proportion of the total base.

Each of the ~400 sites was an organization with multiple users, active projects, ticket history, Confluence documentation, approval workflows, and integrations. For those organizations, the outage was total — not degraded, not partial. Jira inaccessible means engineering teams cannot track bugs, support teams cannot open tickets, releases are blocked. Two weeks in that state for a mid-sized company is measurable operational damage.

What prolonged recovery was the absence of automation in the restore path. Atlassian maintained regular backups — this was critical in ensuring that no data was permanently lost. But the restore process was essentially: identify the correct backup for that specific site, provision a restore environment, execute the restore, validate data integrity, reconnect integrations, and notify the customer. Multiplied by 400, without automated parallelism, this becomes a weeks-long operation even with a dedicated task force.

There is also a prioritization aspect worth noting: Atlassian prioritized enterprise customers and those with the largest user counts. This is rationally defensible from an aggregate impact standpoint, but for the smaller organizations left at the end of the queue, the experience felt like abandonment. Communication around ETAs was consistently criticized as vague — "we are working on it" without concrete dates is insufficient when an entire company is at a standstill.

Remediation and What Atlassian Committed To

In the official post-incident review, Atlassian was notably specific about corrective actions — which is rare and deserves acknowledgment. The main remediation categories announced were:

Prevention of incorrect execution:

Implementation of mandatory dry-run for maintenance scripts that affect production data
ID namespace validation before execution: the system must verify that provided IDs correspond to the expected entity type (app instance vs. customer site)
Mandatory peer review and separate approval for deprovisioning scripts in production
Blast radius limitation by design: maintenance scripts must have explicit limits on the number of entities they can affect in a single execution

Improvement of the recovery path:

Development of mass-restore tooling to enable parallel recovery of multiple sites
Implementation of soft-delete with a retention period before permanent data destruction
Improved communication processes with affected customers, including more granular ETAs

Observability and detection:

Proactive alerts for bulk delete operations exceeding defined thresholds
Improved monitoring of maintenance operations for faster detection of anomalous executions

What strikes me about this list is that none of the items are technically complex. Dry-run, soft-delete, ID type validation — these are well-established engineering practices. The problem was not a lack of technical knowledge; it was the absence of a process that made these practices mandatory for high-risk operations. That is an engineering culture and process problem, not a technical capability one.

Structural Lessons

Soft-delete is not optional in multi-tenant SaaS: Any delete operation in a system managing data for multiple customers must pass through a reversible intermediate state. Direct hard deletes in production are technical debt with compound interest.

Mandatory dry-run for destructive operations: Scripts that modify or delete production data must have a simulation phase that lists the exact scope of the operation before any real execution. This must be enforced by the system, not merely recommended as best practice.

Entity type validation is a security layer: When a script receives IDs as input, the system must validate that those IDs correspond to the expected entity type. Namespace confusion is a classic and preventable error vector.

Backups without restore tooling are necessary but insufficient: Having backups is the minimum. The ability to restore hundreds of entities in parallel, with automation and validation, is what determines the actual RTO in a scaled incident.

Blast radius limits by design: Maintenance scripts must have explicit, hardcoded limits on the number of entities they can affect in a single execution. A script that can delete 400 sites in one run is a script that should not exist in that form.

Incident communication requires concrete ETAs: 'We are working on it' without dates is insufficient for customers with halted operations. Communication must include honest estimates, even if conservative, and frequent updates.

My Senior Take

Senior Solutions Architect

I have worked with financial-grade systems where accidental data deletion carries regulatory implications beyond operational damage — so this case resonates in a particular way. What bothers me is not the mistake itself, but the absence of redundancy in safety controls. In any system managing third-party data at scale, I treat delete operations as high-risk by definition — regardless of how routine the process appears. My standard approach for destructive maintenance scripts: 1. Dry-run as first class: The script has no real execution mode without going through dry-run first. It is not an optional flag — it is the default flow. Dry-run output goes to human review before any real execution approval. 2. Soft-delete with retention window: Never direct hard delete in production for customer resources. The resource goes to PENDING_DELETION for at least 24h. Any alert during that window automatically cancels the deletion. 3. Hardcoded blast radius limit: The script has a --max-entities parameter with a conservative default (e.g., 10). Executing at larger scale requires an explicit override with a logged justification. This forces awareness of scope. 4. Type validation before any operation: IDs pass through a validation layer that confirms entity type before any operation. This is trivial to implement and eliminates an entire class of namespace errors. 5. Regularly tested restore runbook: The restore process is tested in a staging environment monthly. If you have not tested the restore, you do not have a backup — you have stored data whose recoverability is a hypothesis. What Atlassian experienced here is, at its core, a problem of absent safe operations culture for maintenance scripts — an area that frequently falls outside the scope of the more rigorous engineering processes applied to product code. Maintenance scripts are production code. They must be treated as such.

Verdict

The Atlassian incident of April 2022 was not caused by a failing technology, an inadequate architecture, or an incompetent engineer. It was caused by the systematic absence of operational controls that are known, documented, and implementable — and that were simply not applied to maintenance scripts with the same rigor applied to product code. The most important lesson is not in the list of technical remediations, but in the question every platform team should ask regularly: what are the operations that, if executed incorrectly, cause irreversible or slow-to-recover damage — and do those operations have controls proportional to their risk? For maintenance scripts that touch customer data, the correct answer is: mandatory dry-run, soft-delete, entity type validation, blast radius limits, multi-party approval, and tested restore tooling. None of these items are technically sophisticated. All of them are culturally difficult to maintain when operational pressure is high and the script seems routine. Atlassian had the integrity to publish a detailed and honest post-mortem. That is rare and valuable. The cost was paid by the ~400 customers who lost access for up to two weeks. The learning is free for all of us.

References

Atlassian Engineering — Post-Incident Review: April 2022 Outage

#postmortem#atlassian#data-loss#incident-response#soft-delete#operational-safety#cloud-platform#deploy

Case sources

Atlassian — Post-Incident Review (April 2022 outage)

Liked this study? Get the next one.

Post-mortems, ADRs and architecture deep dives in your inbox — the way an architect reads them.

No spam · unsubscribe anytime

Written with AI assistance from the public case and my architect's reading.

Ask Fernando about this

Get a focused answer about this study from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

What happened

Timeline

Apr 5 — Script executed

Maintenance team executes deprovisioning script for the legacy Insight app. The script receives incorrect IDs and executes hard deletes on ~400 active customer sites. Execution occurs without a prior dry-run in production.

Apr 5 — Initial detection

Customers begin reporting total unavailability of their sites. Internal alerts fire. The engineering team identifies that entire sites have been removed — not just specific app instances.

Apr 5–6 — Damage scope assessed

Atlassian confirms ~400 affected sites. It is discovered that no automated mass-restore process exists. Each site must be restored individually from backups, requiring intensive manual engineering work.

Apr 6–10 — Public communication and triage

Atlassian publishes updates on the status page and begins directly contacting affected customers. The company prioritizes customers with the highest impact (larger user counts, enterprise contracts). Public criticism grows due to the lack of a clear ETA.

Apr 10–18 — Phased restoration

Engineering assembles a dedicated task force. The restore process is partially automated over the days, but still requires manual validation per site. Most customers are restored between April 10 and 18.

Apr 18 — Last customer restored

Approximately 14 days after the incident began, the last affected site is restored. Atlassian confirms no data was permanently lost — all data was recovered from backups, but downtime ranged from hours to two weeks.

Apr 2022 — Post-Incident Review published

Atlassian publishes its official post-incident review, detailing causes, timeline, and planned corrective actions. The document is notably transparent by industry standards.