Google Cloud MAJOR

Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state

April 21, 2023 · 10:05 AM UTC – 11:13 PM UTC · Duration: 13h 8min

Affected Services

Google Compute Engine

Timeline

08:01 PM
Mini Incident Report We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 21 April 2023 02:05 Incident End: 21 April 2023 15:13 Duration: 13 hours, 8 minutes Affected Services and Features: Google Compute Engine Control Plane Regions/Zones: Global Description: Google Compute Engine Control Plane health checks failed for any changes made to newly added health checks for a duration of 13 hours and 8 minutes. Preliminary analysis showed a recent network configuration change caused the issue. Customer Impact: During the incident the following GCE actions failed: Any activity that reuses an existing health check, including Instance Groups, and directs it to a new/different Virtual Machine instance. Altering an existing health check. Calling gcloud compute instance-groups managed wait-until --stable on a newly created instance group with an existing health check would fail/timeout as the Instance Group Manager would likely not reach a stable state.
11:18 PM
The issue with Google Compute Engine has been resolved for all affected projects as of Friday, 2023-04-21 15:13 US/Pacific. We thank you for your patience while we worked on resolving the issue.
08:29 PM
Summary: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state Description: Mitigation work is taking longer than expected and is still underway by our engineering team. The mitigation is expected to complete by Friday, 2023-04-21 15:00 US/Pacific. We will provide more information by Friday, 2023-04-21 15:30 US/Pacific. Diagnosis: Customers impacted: any customer that uses Managed Instance Groups. Symptoms: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state. The issue should not impact running VMs. The issue should be occurring only if the VM reaches an unhealthy state. The automation that would usually fix it may not be working. Workaround: None at this time.
03:34 PM
Summary: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state Description: Mitigation work is taking longer than expected and is still underway by our engineering team. The mitigation is expected to complete by Friday, 2023-04-21 12:30 US/Pacific. We will provide more information by Friday, 2023-04-21 13:00 US/Pacific. Diagnosis: Customers impacted: any customer that uses Managed Instance Groups. Symptoms: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state. The issue should not impact running VMs. The issue should be occurring only if the VM reaches an unhealthy state. The automation that would usually fix it may not be working. Workaround: None at this time.
02:03 PM
Summary: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state Description: Mitigation work is still underway by our engineering team. We will provide more information by Friday, 2023-04-21 08:15 US/Pacific. Diagnosis: Customers impacted: any customer that uses Managed Instance Groups. Symptoms: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state. The issue should not impact running VMs. The issue should be occurring only if the VM reaches an unhealthy state. The automation that would usually fix it may not be working. Workaround: None at this time.
01:05 PM
Summary: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2023-04-21 06:15 US/Pacific. Diagnosis: Customers impacted: any customer that uses Managed Instance Groups. Symptoms: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state. The issue should not impact running VMs. The issue should be occurring only if the VM reaches an unhealthy state. The automation that would usually fix it may not be working. Workaround: None at this time.
12:06 PM
Summary: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state Description: We are experiencing an issue with Google Compute Engine beginning at Friday, 2023-04-21 02:00 US/Pacific. Our engineering team is still investigating the issue and working on a mitigation plan. We will provide an update by Friday, 2023-04-21 05:15 US/Pacific with current details. Diagnosis: Customers impacted: any customer that uses Managed Instance Groups in europe-west12-a europe-west12-b europe-west12-c. Symptoms: Managed Instance Groups, that rely on health checks to restart unhealthy VMs, may remain in an unhealthy state. The issue should not impact running VMs. The issue should be occurring only if the VM reaches an unhealthy state. The automation that would usually fix it may not be working. Workaround: None at this time.
11:05 AM
Summary: Managed Instance Groups that rely on health checks to restart unhealthy VMs may remain in an unhealthy state Description: We are experiencing an issue with Google Compute Engine beginning at Friday, 2023-04-21 02:00 US/Pacific. Our engineering team continues to investigate the issue. We will provide an update by Friday, 2023-04-21 04:15 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Customers impacted: any customer that uses Managed Instance Groups and relies on health checks to keep the instances within in a healthy state. Symptoms: Managed Instance Groups that rely on health checks to restart unhealthy VMs may remain in an unhealthy state. The issue should not impact running VMs. The issue should be occurring only if the VM reaches an unhealthy state. The automation that would usually fix it may not be working. Workaround: None at this time.