he Google Compute Engine experienced brief downtime on Thursday. Google mentions that the outage lasted for a total of 160 minutes. An entire summary of the problem has been posted on to Google’s blog. As of Friday AM, the Google Compute Engine seems to be running normally.
Google says that the issue prevented GCE instances from becoming reachable for a duration of 160 minutes. Google has remediated the problem and has taken steps to ensure that this specific problem doesn’t happen again.
What Exactly Happened?
Google engineers noticed a 10% loss in traffic flow from its compute engine starting at 22:40PST on February 18th. The outbound traffic’s loss of flow spiked to 70% at 23:55. For the next 40 minutes, that level of loss was persistent and the virtual machines impacted were unable to connect to resources outside of their own virtual network. During this time, Google engineers were scrambling to remediate the issue.
At 00:35PST on Thursday, the remediation tactics Google implemented began to work. The loss of flow plummeted to a 15% level just 15 minutes later. At 1:20, Google mentions that all impacted services were functioning as normal. Google stresses that any virtual machines impacted by the issue only experienced network issues; the instances themselves did not stop running. The impacted VMs were unable to talk to any other devices outside of its private network.
Cached Routes Expired – Packet Loss Detected
Google believes that their system began dropping cached routes prematurely which contributed to the outage. Simultaneously, the system stopped updating routing information. Google engineers fixed the problem by extending the lease on the route caches from a couple hours to several weeks. They also needed to force a reload on the system. The combination of these two fixes is expected to remediate the issue. Google is also taking additional steps to ensure this issue never happens again. Google has promised its users that a full post mortem will be published on this specific outage next week.
Google left a note for impacted users on its blog saying, “We consider GCE’s availability over the last 24 hours to be unacceptable, and we apologize if your service was affected by this outage.”