What went wrong with UniSuper and Google Cloud?

Update 2024-05-21: Miles Ward has an update with more details in this X thread:

UniSuper’s production Google Cloud VMware Engine (GCVE) private cloud was automatically deleted one year after it’s creation due to a misconfiguration in how it was created. When it was created, there was a bug in the creation script which passed a null value. This caused the private cloud to be created with a one year subscription, rather than a perpetual one. After one year passed, Google Cloud dutifully deleted the private cloud.

The rest of this post is still relevant for more context, but it is now clear that this was indeed a Google Cloud bug. I remain puzzled why the communication over this incident was so bad. By their silence, Google Cloud let millions of people think that they had deleted a company’s entire Google Cloud environment. I’m also confused why this information came out via a non-employee on Twitter rather than through official channels.


Update 2: Google Cloud has published a blog post explaining exactly what happened. The details do bear out that this was entirely a Google Cloud issue from start to finish, with no fault on the part of UniSuper.

Over the past two weeks, UniSuper, an Australian superannuation fund, faced every cloud user’s nightmare: their cloud provider deleted their data. From May 2nd to 13th, UniSuper experienced a major outage and in several updates blamed Google Cloud for the incident.

The disruption of UniSuper services was caused by a combination of rare issues at Google Cloud that resulted in an inadvertent misconfiguration during the provisioning of UniSuper’s Private Cloud, which triggered a previously unknown software bug that impacted UniSuper’s systems. This was an unprecedented occurrence, and measures have been taken to ensure this issue does not happen again.

This culminated in a press release, jointly signed by Google Cloud CEO Thomas Kurian.

Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s Private Cloud subscription.

This is an isolated, ‘one-of-a-kind occurrence’ that has never before occurred with any of Google Cloud’s clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again.

UniSuper had duplication in two geographies as a protection against outages and loss. However, when the deletion of UniSuper’s Private Cloud subscription occurred, it caused deletion across both of these geographies.

Restoring UniSuper’s Private Cloud instance has called for an incredible amount of focus, effort, and partnership between our teams to enable an extensive recovery of all the core systems. The dedication and collaboration between UniSuper and Google Cloud has led to an extensive recovery of our Private Cloud which includes hundreds of virtual machines, databases and applications.

The press release uses vague language, obscuring the technical details of what happened. Given the lack of precision, I assumed this was written solely by UniSuper. Gergely Orosz checked with Google Cloud, and they confirmed that this was an official joint statement.

Given the few details provided, I did more research to try to understand what exactly happened.

What was deleted?

Google Cloud doesn’t have a resource called a “Subscription”; the closest concept would probably be a “Billing Account.” While Google has a Virtual Private Cloud (VPC), you wouldn’t describe it as an instance, and deleting a VPC doesn’t seem like it would have the drastic effects described.

However, if you research UniSuper’s Google Cloud migration, you will see that they use Google Cloud VMWare Engine (GCVE). From June 2023:

The superannuation fund has shifted almost all non-production workloads out of its data centres and into Google Cloud so far.

UniSuper used the Google VMware Engine (GCVE) managed service and engaged partner Kasna to assist with the migration.

GCVE has a resource called a private cloud. A private cloud contains the hosts, management server, storage, and networking for a VMware stack.

Let’s look at the API method for deleting a private cloud:

A PrivateCloud resource scheduled for deletion has PrivateCloud.state set to DELETED and expireTime set to the time when deletion is final and can no longer be reversed. The delete operation is marked as done as soon as the PrivateCloud is successfully scheduled for deletion (this also applies when delayHours is set to zero), and the operation is not kept in pending state until PrivateCloud is purged. PrivateCloud can be restored using privateClouds.undelete method before the expireTime elapses. When expireTime is reached, deletion is final and all private cloud resources are irreversibly removed and billing stops.

All private cloud resources being irreversibly removed sounds a lot like the outage that UniSuper had. Some people have said that Google deleted their entire Google Cloud account. This seems less likely to me because Google has a number of safeguards and delays when deleting projects:

Warning: You can recover most resources if you restore a project within the 30-day period. Some services have delays in restoring and you might need to wait some time for services to be restored. Some resources, such as Cloud Storage or Pub/Sub resources, are deleted much sooner. These resources might not be fully recoverable even if you restore the project within the 30-day period.

Update: User smcwhtdtmc on HN pointed out the Terraform resource for vmwareengine_private_cloud hard-codes the delayHours parameter to 0, i.e. immediate deletion. All private cloud resources being immediately irreversibly deleted sounds a lot like the outage that UniSuper had.

How did everything get deleted?

UniSuper’s press release says:

UniSuper had duplication in two geographies as a protection against outages and loss. However, when the deletion of UniSuper’s Private Cloud subscription occurred, it caused deletion across both of these geographies.

Again, the language here is frustratingly vague and passive but implies that Google Cloud deleted multiple independent private clouds in separate regions. Google Cloud doesn’t have a “geography”; it has zones and regions. At first read, it sounds like they are describing a multi-region setup. Google Cloud has two Australian regions, Sydney and Melbourne, which would make sense.

Looking closer at the docs, though, GCVE offers two kinds of private clouds: a standard private cloud hosted in a single zone or a “stretched private cloud”. A stretched private cloud runs in a single region across two zones, with a third zone as a witness zone for failover. A close reading of the press release doesn’t rule out UniSuper having a single stretched private cloud running in a single region.

So we either have a single stretched private cloud being deleted or two separate private clouds being deleted. A single stretched private cloud seems more likely to me, but I can’t be definitive.

How was it deleted?

So we come to the big question: How were the private cloud(s) deleted? The press release makes heroic use of the passive voice to obscure the actors: “an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s Private Cloud subscription.”

Update: 2024-05-21 Miles Ward has the details on what actually happened in this X thread. My guesses were close, but not quite right.

Based on my experiences with Google Cloud’s professional services team, they, and presumably their partners, recommend Terraform for defining infrastructure as code. This leads to several possible interpretations of this sentence:

  1. UniSuper ran a terraform apply with Terraform code that was “misconfigured”. This triggered a bug in Google Cloud, and Google Cloud accidentally deleted the private cloud.

This is what UniSuper has implied or stated throughout the outage.

  1. UniSuper ran a terraform apply with a bad configuration or perhaps a terraform destroy with the prod tfvar file. The Terraform plan showed “delete private cloud,” and the operator approved it.

Automation errors like this happen every day, although they aren’t usually this catastrophic. This seems more plausible to me than a rare one-in-a-million bug that only affected UniSuper.

  1. UniSuper ran an automation script provided by Google Cloud’s professional services team with a bug. A misconfiguration caused the script to go off the rails. The operator was asked whether to delete the production private cloud, and they said yes.

I find this less plausible, but it is one way to interpret Google Cloud as being at fault for what sounds like a customer error in automation.

Maybe it was a Google Cloud bug?

There are a few holes in my theory that this was primarily UniSuper’s fault:

  1. On Tuesday, 7 May, UniSuper attributed a statement to Google Cloud in which Google Cloud admitted fault for the outage:

The disruption of UniSuper services was caused by a combination of rare issues at Google Cloud that resulted in an inadvertent misconfiguration during the provisioning of UniSuper’s Private Cloud, which triggered a previously unknown software bug that impacted UniSuper’s systems. This was an unprecedented occurrence, and measures have been taken to ensure this issue does not happen again.

  1. Thomas Kurian (apparently) signed off on a joint statement with UniSuper. This statement is less accusatory than UniSuper’s other statements but doesn’t clarify which party was at fault.

  2. Google Cloud hasn’t released any pushback to the story, either directly or through proxies in the media.

Why would Google Cloud not respond to this story if it was UniSuper’s fault? It’s hard to say, but putting out a competing statement blaming or contradicting your customer is a bad look with that customer and with all future customers.

Conclusion

Given how little detail was communicated, it is difficult to make a conclusive statement about what happened, though I personally suspect UniSuper operator error was a large factor. Hopefully, APRA, Australia’s superannuation regulator, will investigate further and release a public report with more details.