Google Cloud MAJOR
Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state
April 21, 2023 · 10:05 AM UTC – 11:13 PM UTC · Duration: 13h 8min
Affected Services
Google Compute Engine
Timeline
08:01 PM
Mini Incident Report
We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.
(All Times US/Pacific)
Incident Start: 21 April 2023 02:05
Incident End: 21 April 2023 15:13
Duration: 13 hours, 8 minutes
Affected Services and Features:
Google Compute Engine Control Plane
Regions/Zones: Global
Description:
Google Compute Engine Control Plane health checks failed for any changes made to newly added health checks for a duration of 13 hours and 8 minutes. Preliminary analysis showed a recent network configuration change caused the issue.
Customer Impact:
During the incident the following GCE actions failed:
Any activity that reuses an existing health check, including Instance Groups, and directs it to a new/different Virtual Machine instance.
Altering an existing health check.
Calling gcloud compute instance-groups managed wait-until --stable on a newly created instance group with an existing health check would fail/timeout as the Instance Group Manager would likely not reach a stable state.
11:18 PM
The issue with Google Compute Engine has been resolved for all affected projects as of Friday, 2023-04-21 15:13 US/Pacific.
We thank you for your patience while we worked on resolving the issue.
08:29 PM
Summary: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state
Description: Mitigation work is taking longer than expected and is still underway by our engineering team.
The mitigation is expected to complete by Friday, 2023-04-21 15:00 US/Pacific.
We will provide more information by Friday, 2023-04-21 15:30 US/Pacific.
Diagnosis: Customers impacted: any customer that uses Managed Instance Groups.
Symptoms: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state.
The issue should not impact running VMs. The issue should be occurring only if the VM reaches an unhealthy state. The automation that would usually fix it may not be working.
Workaround: None at this time.
03:34 PM
Summary: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state
Description: Mitigation work is taking longer than expected and is still underway by our engineering team.
The mitigation is expected to complete by Friday, 2023-04-21 12:30 US/Pacific.
We will provide more information by Friday, 2023-04-21 13:00 US/Pacific.
Diagnosis: Customers impacted: any customer that uses Managed Instance Groups.
Symptoms: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state.
The issue should not impact running VMs. The issue should be occurring only if the VM reaches an unhealthy state. The automation that would usually fix it may not be working.
Workaround: None at this time.
02:03 PM
Summary: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state
Description: Mitigation work is still underway by our engineering team.
We will provide more information by Friday, 2023-04-21 08:15 US/Pacific.
Diagnosis: Customers impacted: any customer that uses Managed Instance Groups.
Symptoms: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state.
The issue should not impact running VMs. The issue should be occurring only if the VM reaches an unhealthy state. The automation that would usually fix it may not be working.
Workaround: None at this time.
01:05 PM
Summary: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state
Description: Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Friday, 2023-04-21 06:15 US/Pacific.
Diagnosis: Customers impacted: any customer that uses Managed Instance Groups.
Symptoms: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state.
The issue should not impact running VMs. The issue should be occurring only if the VM reaches an unhealthy state. The automation that would usually fix it may not be working.
Workaround: None at this time.
12:06 PM
Summary: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state
Description: We are experiencing an issue with Google Compute Engine beginning at Friday, 2023-04-21 02:00 US/Pacific.
Our engineering team is still investigating the issue and working on a mitigation plan.
We will provide an update by Friday, 2023-04-21 05:15 US/Pacific with current details.
Diagnosis: Customers impacted: any customer that uses Managed Instance Groups in europe-west12-a europe-west12-b europe-west12-c.
Symptoms: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state.
The issue should not impact running VMs. The issue should be occurring only if the VM reaches an unhealthy state. The automation that would usually fix it may not be working.
Workaround: None at this time.
11:05 AM
Summary: Managed Instance Groups that rely on health checks to restart unhealthy VMs may remain in an unhealthy state
Description: We are experiencing an issue with Google Compute Engine beginning at Friday, 2023-04-21 02:00 US/Pacific.
Our engineering team continues to investigate the issue.
We will provide an update by Friday, 2023-04-21 04:15 US/Pacific with current details.
We apologize to all who are affected by the disruption.
Diagnosis: Customers impacted: any customer that uses Managed Instance Groups and relies on health checks to keep the instances within in a healthy state.
Symptoms: Managed Instance Groups that rely on health checks to restart unhealthy VMs may remain in an unhealthy state.
The issue should not impact running VMs. The issue should be occurring only if the VM reaches an unhealthy state. The automation that would usually fix it may not be working.
Workaround: None at this time.