Google Cloud × UniSuper (2024): When a Misconfiguration Deleted an Entire Subscription
Listen to study
generated on playGenerated only on first play
In May 2024, a misconfiguration during the provisioning of a private VMware environment on Google Cloud resulted in the complete deletion of UniSuper's subscription across two regions. Recovery was only possible because UniSuper maintained backups with a separate cloud provider — a decision that saved data for 500,000 members. The incident exposes systemic failures in deletion safeguards, automation blast radius, and the fallacy of assuming resilience within a single provider.
Incident Fact Sheet
- Affected organization
- UniSuper — Australian superannuation fund
- Cloud provider
- Google Cloud (GCVE — Google Cloud VMware Engine)
- Incident date
- Start: May 2024 (exact date not publicly disclosed)
- Outage duration
- Approximately 2 weeks for full service restoration
- Member scale
- ~500,000 members with access to services affected
- Affected regions
- Two Google Cloud regions (both UniSuper GCVE instances)
- Root cause
- Misconfiguration during GCVE provisioning that triggered automatic subscription deletion after a trial period
- Recovery factor
- Backups maintained with a separate (non-Google) cloud provider
- Technical stack
- GCVE (VMware Engine), Google Cloud Storage, backups on alternate cloud
- Public statement
- Joint statement UniSuper + Google Cloud — unprecedented in transparency for the industry
A single incorrectly configured parameter during the provisioning of a private VMware cloud deleted UniSuper's entire Google Cloud subscription — across two regions simultaneously. It was not an attack. It was not a hardware failure. It was automation executing exactly what it was instructed to do, with no safeguard to question whether it should. What saved 500,000 members from an irreversible disaster was not the provider's resilience — it was a backup at a different provider.
What happened
UniSuper is one of Australia's largest superannuation funds, managing assets for approximately 500,000 members — predominantly employees in the higher education sector. As part of its infrastructure modernization strategy, UniSuper migrated workloads to Google Cloud using Google Cloud VMware Engine (GCVE), a managed service that allows running VMware workloads natively on Google's infrastructure. GCVE is a common choice for organizations that need to maintain compatibility with legacy VMware environments while adopting the cloud.
During the GCVE environment provisioning process — executed on the Google Cloud side — an operator made a critical error: they inadvertently configured the subscription with an expiration period, as if it were a trial or temporary environment. This type of configuration, when activated, instructs the platform to automatically destroy all resources associated with the subscription at the end of the defined period. UniSuper was not informed of this configuration. There was no visible alert. No additional confirmation was required.
When the configured deadline expired, Google Cloud's automation executed the deletion process with full fidelity: all virtual machines, all data volumes, all snapshots, all internal backups — across both regions where UniSuper operated. The deletion was complete and, within the Google Cloud ecosystem, irreversible. There was no recycle bin. There was no post-deletion retention period covering the full scope of the event. The subscription simply ceased to exist.
What prevented this from becoming a catastrophic and permanent data loss was an architectural decision that, at the time it was made, might have seemed redundant to some: UniSuper maintained copies of its critical data with a completely separate cloud provider, outside the Google ecosystem. This provider separation — not just region or account separation — was the only mechanism that survived the event.
Incident Timeline
- 1
Months prior — GCVE Provisioning
Google Cloud provisions the GCVE environment for UniSuper. During this process, a Google operator inadvertently configures the subscription with an automatic expiration parameter — equivalent to a trial period — instead of a permanent production subscription. This parameter is not clearly visible or communicated to UniSuper.
- 2
Normal operations period
UniSuper operates normally on the GCVE infrastructure. Critical workloads run across both configured regions. Backups are executed regularly — including copies maintained with a separate cloud provider, outside Google. No alert or indication of the expiration parameter is observed.
- 3
May 2024 — Automatic expiration executed
The configured expiration deadline is reached. Google Cloud's automation initiates the subscription deletion process. All associated resources — VMs, volumes, snapshots, internal backups — are destroyed across both regions simultaneously. The deletion is complete within the Google Cloud ecosystem.
- 4
Detection and incident declaration
UniSuper detects total service unavailability. Google Cloud confirms the subscription was deleted and that data is not recoverable within the Google ecosystem. Teams from both organizations begin collaborating to assess the scope and plan recovery.
- 5
Recovery begins — External backups activated
Recovery begins from backups maintained with the alternative cloud provider. This is the only available recovery path. The process is complex: it requires complete reconfiguration of the GCVE environment, restoration of significant data volumes, and integrity revalidation of critical systems for a superannuation fund.
- 6
~2 weeks after incident — Full restoration
UniSuper's services are fully restored. UniSuper and Google Cloud issue a joint statement — unusual in the industry for its transparency — acknowledging the error on Google's side, describing what happened, and confirming that no member data was permanently lost thanks to external backups.
Failure Flow: Cascading Deletion and Recovery Path
The diagram reconstructs UniSuper's operational topology on Google Cloud and the failure path. The deletion propagated from a single configuration parameter to all resources across both regions. The only recovery vector was external to the provider.
- Google Cloud · Control Plane
- GCVE Provisioning · ⚠️ expiry param set
- Subscription · Auto-Delete Job
- GCVE Cluster · Region A
- Virtual Machines · (workloads)
- Persistent Volumes · + Snapshots
- Internal Backups · (Google-side)
- GCVE Cluster · Region B
- Virtual Machines · (workloads)
- Persistent Volumes · + Snapshots
- Internal Backups · (Google-side)
- Offsite Backups · ✅ Survived deletion
- UniSuper · Operations
- ~500k Members · (service disrupted)
Root Cause: Automation Without Blast Radius Limits and Absence of Deletion Safeguards
The root cause was not merely the human error of configuring the wrong parameter — human errors are inevitable. The systemic root cause was the absence of multiple protection layers that should have prevented or limited the impact of such an error: 1. No explicit confirmation for destructive operations at production scale: a parameter that instructs deletion of an entire subscription should require multiple confirmations, separate approval, and customer notification before any execution. 2. No clear differentiation between trial and production environments in the provisioning flow: the same expiration mechanism used for trials was applicable to an active production subscription of a regulated pension fund. 3. No sufficient post-deletion retention period: internal backups were destroyed along with primary data — they were co-dependent on the subscription, not independent of it. 4. Total blast radius: the deletion affected both regions simultaneously because both were tied to the same subscription. Geographic redundancy within the same provider did not protect against an operation executed at the subscription level.
The Recovery and What It Revealed
Recovery took approximately two weeks — a long period for any service, but extraordinarily long for a superannuation fund where members need to access retirement information, make contributions, and take financial decisions. During this period, UniSuper members were without access to their portals and digital services.
The recovery process was technically complex for reasons beyond simple data volume. Restoring a GCVE environment from scratch from external backups is not like restoring files from a disk — it involves reconfiguring the underlying VMware infrastructure, re-establishing network configurations, revalidating data integrity across systems with cross-dependencies, and ensuring the restored state is consistent enough to operate in compliance with the regulatory obligations of an Australian superannuation fund.
What the recovery revealed with brutal clarity is the difference between redundancy and independence. UniSuper had geographic redundancy — two regions. But both regions were under the same control plane, tied to the same subscription. When the subscription was deleted, geographic redundancy became irrelevant. Independence — maintaining data with a completely separate provider, with separate credentials, with a control plane that shares no failure surface with the primary provider — was the only mechanism that mattered.
Also notable is what did not happen: there was no permanent loss of member data. In a scenario where the architectural decision to maintain external backups had not been made, the incident would have been catastrophic and likely irreversible. The joint statement issued by UniSuper's CEO and Google Cloud's CEO is explicit on this point — and the fact that a joint statement was issued is, in itself, significant. Cloud providers rarely publicly admit operational errors with this level of detail.
What Should Exist: Structural Safeguards Against Catastrophic Deletion
This incident is not about an operator who made a mistake. It is about a system that allowed an operator mistake to have unlimited blast radius. Architecting against this type of failure requires controls at multiple layers:
At the provider level (what Google Cloud should have had):
- Structural and mandatory differentiation between trial and production parameters, with explicit validation of subscription type before any expiration configuration.
- Proactive customer notification when any destructive parameter is configured, with a confirmation window and grace period.
- Post-deletion retention period independent of the subscription — data in soft-delete for at least 30 days after any subscription-scale deletion operation.
- Multi-party approval for operations affecting resources across multiple regions simultaneously.
At the customer level (what any critical organization should have):
- Backups outside the primary provider — not just outside the region, but outside the provider. Separate credentials, separate control plane, no dependency on the primary provider for recovery.
- Periodic full restoration tests from external backups — not just existence verification, but DR exercises that validate recovery time and integrity.
- Regular inventory and audit of subscription configurations, including lifecycle parameters that may have been set during provisioning.
- Clear RTO and RPO definition for the total provider loss scenario — not just for zone or region failures.
At the architectural level:
- Separate the identity and access control plane from data resources. In public clouds, a compromised or deleted subscription or organization can take all resources with it. Consider patterns where critical data has protections that survive deletion of the organizational container.
- Implement deletion locks (such as the
resourcemanager.projects.deleteconstraint in GCP, orDeletionPolicy: Retainin AWS CloudFormation) on all critical production resources. These mechanisms exist; they need to be used systematically, not optionally.
The irony of this case is that UniSuper did the right thing — it maintained external backups. But that was an individual decision by a specific organization, not a structural guarantee of the ecosystem. For every UniSuper that made that decision, how many organizations assume that geographic redundancy within the provider is sufficient?
Technical Lessons
I have worked with financial systems for over 16 years and this incident bothers me in a specific way: not because of the error itself, but because of the absence of controls that should be obvious in any platform hosting production data for a regulated financial institution. What concerns me more than the Google operator's error is the implicit trust architecture that most organizations have with their cloud providers. When you place your backups with the same provider as your primary data — even in different regions, even in different accounts — you are assuming the provider will never make an error that broadly affects the control plane. That is an assumption this incident proves to be false. My position is direct: for any system where data loss is unacceptable, the backup strategy must include at least one copy completely outside the primary provider, with access managed by credentials that do not depend on the primary provider for authentication or authorization. This is not paranoia — it is the reasonable minimum given what we know about control plane blast radius. There is also a lesson about automation that is frequently underestimated: automation is faithful, not intelligent. It executes what it was configured to execute, without considering context, without questioning intent, without evaluating consequences. The work of limiting automation blast radius belongs to the architect, not to the automation. Deletion locks, multi-party approvals, grace periods, and proactive notifications are the architectural equivalent of 'are you sure?' — and they need to be defaults, not options. Finally: UniSuper made the right decision to maintain external backups. But that decision was made despite the ecosystem, not because of it. What we need is an ecosystem where the right decision is the path of least resistance — where the default configuration of any production subscription includes protections against accidental deletion, not where the customer needs to discover and configure this manually.
Verdict
The 2024 UniSuper-Google Cloud incident is one of the most important cloud resilience cases of the last decade — not because it was the largest in scale, but because it exposes a foundational assumption failure that is endemic in the industry: the belief that redundancy within a provider is equivalent to resilience against the provider. UniSuper survived because it made an architectural decision that, at the time, may have seemed overly conservative to some: maintaining backups completely outside of Google Cloud. That decision — and not any Google Cloud capability — was what prevented permanent data loss for 500,000 superannuation fund members. The implications are direct for any critical systems architect: What this case proves: Geographic redundancy within a single provider does not protect against control plane failures. Backups co-located with primary data at the same provider are not independent backups. Automation without safeguards proportional to impact is a systemic risk, not an operational efficiency. What this case recommends: Backups outside the primary provider as a non-negotiable requirement for critical data. Deletion locks as defaults on production resources. DR tests for the total provider loss scenario, not just zone failures. Periodic audits of subscription lifecycle configurations. What this case leaves open: How many organizations are operating today with the same exposure UniSuper had — without the external backups that saved it? The honest answer is: probably most of them. And that is the true legacy of this incident.