Studies
Post-mortemGoogle Cloud / UniSuperCloud/Resiliência

Google Cloud × UniSuper (2024): When a Misconfiguration Deleted an Entire Subscription

May 2, 2024 11 min AI-assisted
Share:

Listen to study

generated on play

Generated only on first play

On demand
0:000:00
Speed
The MP3 is saved to S3 after the first play.

In May 2024, a misconfiguration during the provisioning of a private VMware environment on Google Cloud resulted in the complete deletion of UniSuper's subscription across two regions. Recovery was only possible because UniSuper maintained backups with a separate cloud provider — a decision that saved data for 500,000 members. The incident exposes systemic failures in deletion safeguards, automation blast radius, and the fallacy of assuming resilience within a single provider.

Incident Fact Sheet

Affected organization
UniSuper — Australian superannuation fund
Cloud provider
Google Cloud (GCVE — Google Cloud VMware Engine)
Incident date
Start: May 2024 (exact date not publicly disclosed)
Outage duration
Approximately 2 weeks for full service restoration
Member scale
~500,000 members with access to services affected
Affected regions
Two Google Cloud regions (both UniSuper GCVE instances)
Root cause
Misconfiguration during GCVE provisioning that triggered automatic subscription deletion after a trial period
Recovery factor
Backups maintained with a separate (non-Google) cloud provider
Technical stack
GCVE (VMware Engine), Google Cloud Storage, backups on alternate cloud
Public statement
Joint statement UniSuper + Google Cloud — unprecedented in transparency for the industry

A single incorrectly configured parameter during the provisioning of a private VMware cloud deleted UniSuper's entire Google Cloud subscription — across two regions simultaneously. It was not an attack. It was not a hardware failure. It was automation executing exactly what it was instructed to do, with no safeguard to question whether it should. What saved 500,000 members from an irreversible disaster was not the provider's resilience — it was a backup at a different provider.

What happened

UniSuper is one of Australia's largest superannuation funds, managing assets for approximately 500,000 members — predominantly employees in the higher education sector. As part of its infrastructure modernization strategy, UniSuper migrated workloads to Google Cloud using Google Cloud VMware Engine (GCVE), a managed service that allows running VMware workloads natively on Google's infrastructure. GCVE is a common choice for organizations that need to maintain compatibility with legacy VMware environments while adopting the cloud.

During the GCVE environment provisioning process — executed on the Google Cloud side — an operator made a critical error: they inadvertently configured the subscription with an expiration period, as if it were a trial or temporary environment. This type of configuration, when activated, instructs the platform to automatically destroy all resources associated with the subscription at the end of the defined period. UniSuper was not informed of this configuration. There was no visible alert. No additional confirmation was required.

When the configured deadline expired, Google Cloud's automation executed the deletion process with full fidelity: all virtual machines, all data volumes, all snapshots, all internal backups — across both regions where UniSuper operated. The deletion was complete and, within the Google Cloud ecosystem, irreversible. There was no recycle bin. There was no post-deletion retention period covering the full scope of the event. The subscription simply ceased to exist.

What prevented this from becoming a catastrophic and permanent data loss was an architectural decision that, at the time it was made, might have seemed redundant to some: UniSuper maintained copies of its critical data with a completely separate cloud provider, outside the Google ecosystem. This provider separation — not just region or account separation — was the only mechanism that survived the event.

Incident Timeline

  1. 1

    Months prior — GCVE Provisioning

    Google Cloud provisions the GCVE environment for UniSuper. During this process, a Google operator inadvertently configures the subscription with an automatic expiration parameter — equivalent to a trial period — instead of a permanent production subscription. This parameter is not clearly visible or communicated to UniSuper.

  2. 2

    Normal operations period

    UniSuper operates normally on the GCVE infrastructure. Critical workloads run across both configured regions. Backups are executed regularly — including copies maintained with a separate cloud provider, outside Google. No alert or indication of the expiration parameter is observed.

  3. 3

    May 2024 — Automatic expiration executed

    The configured expiration deadline is reached. Google Cloud's automation initiates the subscription deletion process. All associated resources — VMs, volumes, snapshots, internal backups — are destroyed across both regions simultaneously. The deletion is complete within the Google Cloud ecosystem.

  4. 4

    Detection and incident declaration

    UniSuper detects total service unavailability. Google Cloud confirms the subscription was deleted and that data is not recoverable within the Google ecosystem. Teams from both organizations begin collaborating to assess the scope and plan recovery.

  5. 5

    Recovery begins — External backups activated

    Recovery begins from backups maintained with the alternative cloud provider. This is the only available recovery path. The process is complex: it requires complete reconfiguration of the GCVE environment, restoration of significant data volumes, and integrity revalidation of critical systems for a superannuation fund.

  6. 6

    ~2 weeks after incident — Full restoration

    UniSuper's services are fully restored. UniSuper and Google Cloud issue a joint statement — unusual in the industry for its transparency — acknowledging the error on Google's side, describing what happened, and confirming that no member data was permanently lost thanks to external backups.

Failure Flow: Cascading Deletion and Recovery Path

The diagram reconstructs UniSuper's operational topology on Google Cloud and the failure path. The deletion propagated from a single configuration parameter to all resources across both regions. The only recovery vector was external to the provider.

⚙️ Google Cloud — Plano de Controle / Control Plane
  • Google Cloud · Control Plane
  • GCVE Provisioning · ⚠️ expiry param set
  • Subscription · Auto-Delete Job
🌏 Google Cloud — Região A / Region A
  • GCVE Cluster · Region A
  • Virtual Machines · (workloads)
  • Persistent Volumes · + Snapshots
  • Internal Backups · (Google-side)
🌏 Google Cloud — Região B / Region B
  • GCVE Cluster · Region B
  • Virtual Machines · (workloads)
  • Persistent Volumes · + Snapshots
  • Internal Backups · (Google-side)
🛡️ Provedor Externo / External Cloud Provider
  • Offsite Backups · ✅ Survived deletion
👤 UniSuper
  • UniSuper · Operations
  • ~500k Members · (service disrupted)

Root Cause: Automation Without Blast Radius Limits and Absence of Deletion Safeguards

The root cause was not merely the human error of configuring the wrong parameter — human errors are inevitable. The systemic root cause was the absence of multiple protection layers that should have prevented or limited the impact of such an error: 1. No explicit confirmation for destructive operations at production scale: a parameter that instructs deletion of an entire subscription should require multiple confirmations, separate approval, and customer notification before any execution. 2. No clear differentiation between trial and production environments in the provisioning flow: the same expiration mechanism used for trials was applicable to an active production subscription of a regulated pension fund. 3. No sufficient post-deletion retention period: internal backups were destroyed along with primary data — they were co-dependent on the subscription, not independent of it. 4. Total blast radius: the deletion affected both regions simultaneously because both were tied to the same subscription. Geographic redundancy within the same provider did not protect against an operation executed at the subscription level.

The Recovery and What It Revealed

Recovery took approximately two weeks — a long period for any service, but extraordinarily long for a superannuation fund where members need to access retirement information, make contributions, and take financial decisions. During this period, UniSuper members were without access to their portals and digital services.

The recovery process was technically complex for reasons beyond simple data volume. Restoring a GCVE environment from scratch from external backups is not like restoring files from a disk — it involves reconfiguring the underlying VMware infrastructure, re-establishing network configurations, revalidating data integrity across systems with cross-dependencies, and ensuring the restored state is consistent enough to operate in compliance with the regulatory obligations of an Australian superannuation fund.

What the recovery revealed with brutal clarity is the difference between redundancy and independence. UniSuper had geographic redundancy — two regions. But both regions were under the same control plane, tied to the same subscription. When the subscription was deleted, geographic redundancy became irrelevant. Independence — maintaining data with a completely separate provider, with separate credentials, with a control plane that shares no failure surface with the primary provider — was the only mechanism that mattered.

Also notable is what did not happen: there was no permanent loss of member data. In a scenario where the architectural decision to maintain external backups had not been made, the incident would have been catastrophic and likely irreversible. The joint statement issued by UniSuper's CEO and Google Cloud's CEO is explicit on this point — and the fact that a joint statement was issued is, in itself, significant. Cloud providers rarely publicly admit operational errors with this level of detail.

What Should Exist: Structural Safeguards Against Catastrophic Deletion

This incident is not about an operator who made a mistake. It is about a system that allowed an operator mistake to have unlimited blast radius. Architecting against this type of failure requires controls at multiple layers:

At the provider level (what Google Cloud should have had):

  • Structural and mandatory differentiation between trial and production parameters, with explicit validation of subscription type before any expiration configuration.
  • Proactive customer notification when any destructive parameter is configured, with a confirmation window and grace period.
  • Post-deletion retention period independent of the subscription — data in soft-delete for at least 30 days after any subscription-scale deletion operation.
  • Multi-party approval for operations affecting resources across multiple regions simultaneously.

At the customer level (what any critical organization should have):

  • Backups outside the primary provider — not just outside the region, but outside the provider. Separate credentials, separate control plane, no dependency on the primary provider for recovery.
  • Periodic full restoration tests from external backups — not just existence verification, but DR exercises that validate recovery time and integrity.
  • Regular inventory and audit of subscription configurations, including lifecycle parameters that may have been set during provisioning.
  • Clear RTO and RPO definition for the total provider loss scenario — not just for zone or region failures.

At the architectural level:

  • Separate the identity and access control plane from data resources. In public clouds, a compromised or deleted subscription or organization can take all resources with it. Consider patterns where critical data has protections that survive deletion of the organizational container.
  • Implement deletion locks (such as the resourcemanager.projects.delete constraint in GCP, or DeletionPolicy: Retain in AWS CloudFormation) on all critical production resources. These mechanisms exist; they need to be used systematically, not optionally.

The irony of this case is that UniSuper did the right thing — it maintained external backups. But that was an individual decision by a specific organization, not a structural guarantee of the ecosystem. For every UniSuper that made that decision, how many organizations assume that geographic redundancy within the provider is sufficient?

Technical Lessons

Geographic redundancy within the same provider does not protect against control plane or subscription failures. Two regions tied to the same subscription share the same blast radius for operations executed at the subscription level.
Backups outside the primary provider are a requirement, not an option, for critical systems. 'Outside the provider' means separate credentials, separate control plane, no operational dependency on the primary provider.
Deletion locks must be applied systematically to production resources. GCP, AWS, and Azure all offer mechanisms to prevent accidental deletion. They should be part of the security baseline, not optional configurations.
Destructive operations at subscription or organization scale need safeguards proportional to impact. Multi-party confirmation, customer notification, grace period, and post-deletion soft-delete are necessary controls, not optional ones.
Test your DR regularly and for the right scenario. Testing zone restoration is different from testing full provider loss restoration. The second scenario is what mattered here.
Audit subscription lifecycle configurations periodically. Parameters configured during provisioning — especially by third parties — may contain destructive settings not visible in day-to-day operations.
FA
My Senior Take
Senior Solutions Architect

I have worked with financial systems for over 16 years and this incident bothers me in a specific way: not because of the error itself, but because of the absence of controls that should be obvious in any platform hosting production data for a regulated financial institution. What concerns me more than the Google operator's error is the implicit trust architecture that most organizations have with their cloud providers. When you place your backups with the same provider as your primary data — even in different regions, even in different accounts — you are assuming the provider will never make an error that broadly affects the control plane. That is an assumption this incident proves to be false. My position is direct: for any system where data loss is unacceptable, the backup strategy must include at least one copy completely outside the primary provider, with access managed by credentials that do not depend on the primary provider for authentication or authorization. This is not paranoia — it is the reasonable minimum given what we know about control plane blast radius. There is also a lesson about automation that is frequently underestimated: automation is faithful, not intelligent. It executes what it was configured to execute, without considering context, without questioning intent, without evaluating consequences. The work of limiting automation blast radius belongs to the architect, not to the automation. Deletion locks, multi-party approvals, grace periods, and proactive notifications are the architectural equivalent of 'are you sure?' — and they need to be defaults, not options. Finally: UniSuper made the right decision to maintain external backups. But that decision was made despite the ecosystem, not because of it. What we need is an ecosystem where the right decision is the path of least resistance — where the default configuration of any production subscription includes protections against accidental deletion, not where the customer needs to discover and configure this manually.

Verdict

The 2024 UniSuper-Google Cloud incident is one of the most important cloud resilience cases of the last decade — not because it was the largest in scale, but because it exposes a foundational assumption failure that is endemic in the industry: the belief that redundancy within a provider is equivalent to resilience against the provider. UniSuper survived because it made an architectural decision that, at the time, may have seemed overly conservative to some: maintaining backups completely outside of Google Cloud. That decision — and not any Google Cloud capability — was what prevented permanent data loss for 500,000 superannuation fund members. The implications are direct for any critical systems architect: What this case proves: Geographic redundancy within a single provider does not protect against control plane failures. Backups co-located with primary data at the same provider are not independent backups. Automation without safeguards proportional to impact is a systemic risk, not an operational efficiency. What this case recommends: Backups outside the primary provider as a non-negotiable requirement for critical data. Deletion locks as defaults on production resources. DR tests for the total provider loss scenario, not just zone failures. Periodic audits of subscription lifecycle configurations. What this case leaves open: How many organizations are operating today with the same exposure UniSuper had — without the external backups that saved it? The honest answer is: probably most of them. And that is the true legacy of this incident.

#postmortem#google-cloud#resiliência#backup#gcve#exclusão-acidental#multi-cloud#disaster-recovery
Share:
Written with AI assistance from the public case and my architect's reading.