Earlier this week, the Google Cloud Platform blog reported an incident that led to a 21 hour and 38 minute outage concerning administrators who tried to create an external load balancer for their applications.
The Google Cloud Platform blog notes that this outage occurred between December 7th and 8th, giving users a 400 invalid argument error when attempting deploy load balancing infrastructure in the cloud. Google estimates that 6.7% of its users clusters may have been impacted by this outage.
So what’s the root cause of this issue?
It appears that an update was released into the Google Cloud Platform. The update was scheduled to be a “Minor” upgrade, as described by Google. The minor upgrade inflicted major pain as the outage was widely reported throughout the cloud blogosphere.
More about the Root Cause:
It’s refreshing to see Google Cloud Platform own up to their errors.
While the outage itself is discouraging, it’s a bit refreshing to have a cloud provider that is so open and transparent about the inner workings of their product.
In an explanation concerning the incident, the Google Cloud Platform status blog says:
“The Compute Engine API inadvertently changed the case-sensitivity of the “sessionAffinity” enum variable in the target pool definition, and this variation was not covered by testing.”
The blog goes on to say: “Google Container Engine was not aware of this change and sent requests with incompatible case, causing the Compute Engine API to return an error status.”
Mistakes Happen; Always Plan For the Worst
Outages are always teachable moments in the world of IT. Whether its human error, human oversight or an act of God, it is important to always plan for the worst.
Taking a multi cloud approach to deploying IaaS will ensure the utmost reliability in your services, without having to rely on a sole cloud service provider to operate optimally.