Atlassian 2022: when a maintenance script deletes 400 customers for two weeks
Listen to study
generated on playGenerated only on first play
In April 2022, a poorly parameterized maintenance script executed hard deletes on ~400 Atlassian Cloud customer sites, including Jira, Confluence, and other products. The absence of soft-delete, lack of mandatory dry-run, and nonexistent mass-restore tooling turned a minutes-long operational mistake into up to 14 days of downtime. This post-mortem examines the failure chain, the real blast radius, and the structural lessons every platform team should internalize.
Incident Facts
- Company
- Atlassian
- Start date
- April 5, 2022
- Total recovery duration
- Up to 14 days for the most affected customers
- Impacted customers
- ~400 sites (approx. 0.18% of the cloud customer base)
- Affected products
- Jira Software, Jira Service Management, Jira Work Management, Confluence, Atlassian Access, Statuspage, Atlassian Access
- Failure type
- Accidental hard delete via maintenance script with incorrect IDs
- Primary root cause
- Script received a list of IDs from a legacy app but deleted active production site IDs
- Relevant stack
- Atlassian Cloud (multi-tenant SaaS), internal provisioning and deprovisioning systems
- Official source
- Post-Incident Review published by Atlassian Engineering
Operational mistakes happen. What separates a contained incident from a two-week catastrophe is not the absence of the mistake — it is the absence of the defense layers that should have existed before, during, and after it. The Atlassian incident of April 2022 is a textbook of what not to do when operating a multi-tenant SaaS platform at scale: a script without mandatory dry-run, hard deletes without soft-delete as a safety net, and the late discovery that restoring 400 sites individually, without adequate tooling, would take weeks. This post-mortem is not about blame — it is about architecture.
What happened
Atlassian operates a continuous data cleanup process to remove instances of legacy applications that have been discontinued or disconnected by customers. On April 5, 2022, the responsible team executed a maintenance script with the goal of deprovisioning a specific set of instances of a legacy app — Atlassian Insight (later integrated into Jira Service Management). The script was part of a broader product consolidation process.
The core problem was one of parameterization: the script received as input a list of legacy app instance IDs to be removed, but the execution mechanism interpreted those IDs in the wrong context — applying them as identifiers of active customer sites on the Atlassian Cloud platform. The result was the execution of hard deletes on approximately 400 production sites, wiping data across multiple products per site: Jira Software, Jira Service Management, Confluence, Atlassian Access, and others.
The term "hard delete" here is precise and critical: this was not a logical marking for future removal, nor a soft-delete reversible by administrative operation. The data was permanently deleted from primary storage systems. There was no immediate rollback mechanism. Recovery would depend entirely on backups — and restoring backups, per site, manually, for hundreds of customers, proved to be the true bottleneck of the incident.
Timeline
- 1
Apr 5 — Script executed
Maintenance team executes deprovisioning script for the legacy Insight app. The script receives incorrect IDs and executes hard deletes on ~400 active customer sites. Execution occurs without a prior dry-run in production.
- 2
Apr 5 — Initial detection
Customers begin reporting total unavailability of their sites. Internal alerts fire. The engineering team identifies that entire sites have been removed — not just specific app instances.
- 3
Apr 5–6 — Damage scope assessed
Atlassian confirms ~400 affected sites. It is discovered that no automated mass-restore process exists. Each site must be restored individually from backups, requiring intensive manual engineering work.
- 4
Apr 6–10 — Public communication and triage
Atlassian publishes updates on the status page and begins directly contacting affected customers. The company prioritizes customers with the highest impact (larger user counts, enterprise contracts). Public criticism grows due to the lack of a clear ETA.
- 5
Apr 10–18 — Phased restoration
Engineering assembles a dedicated task force. The restore process is partially automated over the days, but still requires manual validation per site. Most customers are restored between April 10 and 18.
- 6
Apr 18 — Last customer restored
Approximately 14 days after the incident began, the last affected site is restored. Atlassian confirms no data was permanently lost — all data was recovered from backups, but downtime ranged from hours to two weeks.
- 7
Apr 2022 — Post-Incident Review published
Atlassian publishes its official post-incident review, detailing causes, timeline, and planned corrective actions. The document is notably transparent by industry standards.
Failure Flow: How the Script Deleted Production Sites
The diagram reconstructs the execution flow of the maintenance script and how the ID confusion propagated hard deletes to the storage systems of multiple products. Dashed arrows indicate the path that should have existed (dry-run, soft-delete) but did not.
- Engenheiro · de Manutenção
- Script de · Desprovisionamento
- Lista de IDs · (Insight legado)
- API de · Provisionamento
- Registro de Sites · (IDs de produção)
- Motor de · Deleção
- Jira Software · + JSM Data
- Confluence · Data
- Atlassian Access · + outros
- Backups · (restauração manual)
- Dry-Run · (não existia)
- Soft-Delete · (não implementado)
- Restore em Massa · (não existia)
Root Cause: Three Absences, Not One Mistake
The immediate root cause was ID namespace confusion: the script treated instance IDs from a legacy app as production site IDs, and the deletion engine did not validate the context. But the true systemic root cause is the absence of three independent defense layers:
1. No mandatory dry-run: The script was executed directly in production without a simulation phase that would list what would be deleted without actually deleting it. A dry-run would have exposed the error before any damage.
2. No soft-delete: The deprovisioning system executed hard deletes directly. A soft-delete pattern — marking the resource as PENDING_DELETION with a 24–72h retention window — would have allowed immediate reversal after error detection.
3. No mass-restore tooling: Atlassian had backups. The problem was not the absence of recoverable data — it was the absence of infrastructure to restore hundreds of sites in parallel. The manual process was the time multiplier that turned a hours-long incident into a weeks-long one.
The Real Blast Radius and Why It Took Two Weeks
It is tempting to reduce this incident to "0.18% of customers" and move on. But blast radius must be measured in terms of impact per affected customer, not just as a proportion of the total base.
Each of the ~400 sites was an organization with multiple users, active projects, ticket history, Confluence documentation, approval workflows, and integrations. For those organizations, the outage was total — not degraded, not partial. Jira inaccessible means engineering teams cannot track bugs, support teams cannot open tickets, releases are blocked. Two weeks in that state for a mid-sized company is measurable operational damage.
What prolonged recovery was the absence of automation in the restore path. Atlassian maintained regular backups — this was critical in ensuring that no data was permanently lost. But the restore process was essentially: identify the correct backup for that specific site, provision a restore environment, execute the restore, validate data integrity, reconnect integrations, and notify the customer. Multiplied by 400, without automated parallelism, this becomes a weeks-long operation even with a dedicated task force.
There is also a prioritization aspect worth noting: Atlassian prioritized enterprise customers and those with the largest user counts. This is rationally defensible from an aggregate impact standpoint, but for the smaller organizations left at the end of the queue, the experience felt like abandonment. Communication around ETAs was consistently criticized as vague — "we are working on it" without concrete dates is insufficient when an entire company is at a standstill.
Remediation and What Atlassian Committed To
In the official post-incident review, Atlassian was notably specific about corrective actions — which is rare and deserves acknowledgment. The main remediation categories announced were:
Prevention of incorrect execution:
- Implementation of mandatory dry-run for maintenance scripts that affect production data
- ID namespace validation before execution: the system must verify that provided IDs correspond to the expected entity type (app instance vs. customer site)
- Mandatory peer review and separate approval for deprovisioning scripts in production
- Blast radius limitation by design: maintenance scripts must have explicit limits on the number of entities they can affect in a single execution
Improvement of the recovery path:
- Development of mass-restore tooling to enable parallel recovery of multiple sites
- Implementation of soft-delete with a retention period before permanent data destruction
- Improved communication processes with affected customers, including more granular ETAs
Observability and detection:
- Proactive alerts for bulk delete operations exceeding defined thresholds
- Improved monitoring of maintenance operations for faster detection of anomalous executions
What strikes me about this list is that none of the items are technically complex. Dry-run, soft-delete, ID type validation — these are well-established engineering practices. The problem was not a lack of technical knowledge; it was the absence of a process that made these practices mandatory for high-risk operations. That is an engineering culture and process problem, not a technical capability one.
Structural Lessons
I have worked with financial-grade systems where accidental data deletion carries regulatory implications beyond operational damage — so this case resonates in a particular way. What bothers me is not the mistake itself, but the absence of redundancy in safety controls.
In any system managing third-party data at scale, I treat delete operations as high-risk by definition — regardless of how routine the process appears. My standard approach for destructive maintenance scripts:
1. Dry-run as first class: The script has no real execution mode without going through dry-run first. It is not an optional flag — it is the default flow. Dry-run output goes to human review before any real execution approval.
2. Soft-delete with retention window: Never direct hard delete in production for customer resources. The resource goes to PENDING_DELETION for at least 24h. Any alert during that window automatically cancels the deletion.
3. Hardcoded blast radius limit: The script has a --max-entities parameter with a conservative default (e.g., 10). Executing at larger scale requires an explicit override with a logged justification. This forces awareness of scope.
4. Type validation before any operation: IDs pass through a validation layer that confirms entity type before any operation. This is trivial to implement and eliminates an entire class of namespace errors.
5. Regularly tested restore runbook: The restore process is tested in a staging environment monthly. If you have not tested the restore, you do not have a backup — you have stored data whose recoverability is a hypothesis.
What Atlassian experienced here is, at its core, a problem of absent safe operations culture for maintenance scripts — an area that frequently falls outside the scope of the more rigorous engineering processes applied to product code. Maintenance scripts are production code. They must be treated as such.
Verdict
The Atlassian incident of April 2022 was not caused by a failing technology, an inadequate architecture, or an incompetent engineer. It was caused by the systematic absence of operational controls that are known, documented, and implementable — and that were simply not applied to maintenance scripts with the same rigor applied to product code. The most important lesson is not in the list of technical remediations, but in the question every platform team should ask regularly: what are the operations that, if executed incorrectly, cause irreversible or slow-to-recover damage — and do those operations have controls proportional to their risk? For maintenance scripts that touch customer data, the correct answer is: mandatory dry-run, soft-delete, entity type validation, blast radius limits, multi-party approval, and tested restore tooling. None of these items are technically sophisticated. All of them are culturally difficult to maintain when operational pressure is high and the script seems routine. Atlassian had the integrity to publish a detailed and honest post-mortem. That is rare and valuable. The cost was paid by the ~400 customers who lost access for up to two weeks. The learning is free for all of us.