For the second time in the past couple of weeks, Google Compute Engine has battled a problem that has produced bouts of unavailability for those running workloads on virtual machines.
The problem initially started around the 3rd week of February, when CloudWedge wrote a report on the original issues. Staff at GCE quickly addressed the network egress issues and issued an apology. Symptoms of the network egress issue seemed to have subsided for a couple weeks, until the problem flared up again over this past weekend.
As staff began deploying patches over the weekend, they noticed that the network egress had reappeared. Google notes that many clients did not experience any issues, while some had virtual machine unavailability issues. Other clients had reported a slow connection to their VMs. Google mentions the latest version of the network egress issues started at March 7th 2015 at 9:55AM PST. Only 43 minutes later, Google had announced that they had remediated the network egress issue.
On Google’s Compute Engine blog, the root cause of the problem causing packet loss was described as, “A configuration change introduced to the network stack designed to provide greater isolation between VMs and projects by capping the traffic volume allowed by an individual VM. The configuration change had been tested prior to deployment to production without incident. However as it was introduced into the production environment it affected some VMs in an unexpected manner.”
According to comments gathered from the GCE blog page, it seems as if Google engineers are still asking questions as to why their tests seemed to have worked in the test environment but not within the actual production environment. Google followed up the root cause with a policy change in regards to its patching process.
Google writes, “Future changes will not be applied to production until the test suite has been improved to demonstrate parity with behavior observed in production during this incident. “