Google Cloud CRITICAL
Multiple cloud products are experiencing networking issues in us-central1
October 5, 2023 · 11:08 AM UTC – 06:55 PM UTC · Duration: 7h 47min
Affected Services
BatchVirtual Private Cloud (VPC)Google Compute EngineGoogle Kubernetes EngineGoogle Cloud DataflowGoogle Cloud NetworkingGoogle Cloud SQLCloud FilestoreCloud Data FusionGoogle Cloud Dataproc
Timeline
03:30 PM
Incident Report
Summary
On 5 October, multiple Google Cloud products experienced networking connectivity issues which impacted new and migrated VMs in the us-central1 region for a duration of 7 hours, 47 minutes. Existing VMs were not directly affected. We sincerely apologize for the impact caused to your business. We have identified the root cause and are taking immediate steps to prevent future failures.
Root Cause
The root cause of the issues was a management plane behavior change that had been rolling out slowly across Google Cloud. The aim of the change was to provide better decoupling in processing API updates to GCP Instance Groups and Network Endpoint Groups used as load balancer backends, thus providing better reliability and performance.
This change had been rolled out to several regions without incident. However, when it was deployed in us-central1, large workload sizes in the region triggered an unexpected memory increase for the control plane for virtual network routers. The controllers eventually ran out of memory, and although they were automatically restarted, the large workload size meant that they repeated the out-of-memory and restart sequence.
Virtual routers and their controllers are deployed into separate zonal failure domains. However, as the management plane change affected a regional API, this extended the issue to all virtual routers in the region, causing synchronized memory pressure and unavailability of controllers.
This unavailability of controllers prevented the virtual network routers from being updated with fresh state, such as new VMs, new locations of migrated VMs, dynamic routes, and health state of load balancer backends. As the frequency of out-of-memory events increased, delays in updating router state increased until there was no practical progress being made.
Existing VMs that did not migrate and did not change their health state were not affected directly. However, traffic to or from these VMs may have passed through a separate affected device such as a VPN Gateway, internal load balancer, or other VM.
There are separate sets of virtual routers for intra-region and cross-region traffic, each with their own control plane component. The cross-region routers were affected first and for a longer duration than the intra-region routers.
Remediation and Prevention
Google engineers were alerted to slowness in the virtual network control plane in us-central1 on 04 October at 21:45 US/Pacific and immediately started investigations. Initial investigations revealed that slowness was intermittent. At 02:11 US/Pacific on 05 October alerts were received for failures in the virtual network router controllers due to memory exhaustion. Engineers immediately began an attempt to mitigate by allocating more memory. At 03:08 US/Pacific, our networking telemetry began to indicate cross-region packet loss to or from us-central1.
By 05:27 US/Pacific, the memory allocation change started to reach production. At 07:00 US/Pacific, the telemetry indicated intra-region packet loss primarily to and from us-central1-c, but it then subsided at 08:15 US/Pacific due to the rollout of the increased memory allocation.
At 08:22 US/Pacific, the increased memory usage was correlated with the rollout of the management plane change. At 08:52 US/Pacific, a rollback of the management plane change was started, completing in us-central1 at 09:35 US/Pacific. At this point all out of memory events had stopped.
While impact had been greatly reduced, a small number of routers were not accepting updates and had to be manually restarted. These restarts did not cause any additional packet loss. By 10:55 US/Pacific all packet loss had stopped and the control plane was processing updates normally.
If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you. Google is committed to preventing a repeat of this issue in the future and is completing the following actions:
Proactive alerting of memory risks and unexpected increases.
Refactoring our deployment configuration to allow engineers to reallocate memory much more quickly.
Re-evaluating existing and establishing new practices and safety mechanisms for API reconciliation.
Increasing visibility of management plane changes across teams so that they can be correlated more quickly.
Adjusting our deployment footprint to reduce the chance of simultaneous regional memory exhaustion due to regional API changes.
Memory optimizations in the traffic routers and their controllers to prevent unnecessary overhead.
Detailed Description of Impact
On 5 October 2023 from 03:08 to 10:55 US/Pacific, multiple Google cloud products experienced networking connectivity issues in us-central1. Newly created and recently migrated VMs experienced extended delays before networking became functional. This impacted higher level workloads that rely on provisioning VMs.
Virtual Private Cloud:
Newly created VMs for some projects in us-central1 experienced extended delays before networking became funcitonal.
Live migrated VMs for some projects in us-central1 experienced extended loss of connectivity after migrating.
Health check state of VMs in some projects in us-central1 were not being propagated to load balancers in a timely manner.
From 3:08 to 10:55, impact was largely limited to cross-regional virtual network traffic. From 7:00 to 8:15, there was a substantial impact on intra-region flows.
Google Kubernetes Engine:
Up to 0.4 percent of clusters in us-central1 may have experienced downtime and/or delays during the cluster operations such as recreate and upgrade.
24 percent of cluster creation attempts experienced failure or delay. For the majority, no action was needed with the operations succeeding after a period of up to 120 minutes.
Note that affected operations may have been triggered automatically by Google, e.g. autoupgrade or node repair, as well as by customers.
During the downtime, in the Google Cloud Console Clusters page, customers may have seen that some Nodes had not registered and the cluster was unhealthy.
Cloud Data Fusion:
New Cloud Data Fusion VM creation may have failed in the us-central1 region. This issue impacted existing instance operations in the us-central1 region.
Around 27 percent of the total Data Fusion requests in us-central1 at the time encountered issues.
Cloud Filestore:
New Filestore VMs created during upgrades, were unable to communicate with each other. Filestore upgrades may have failed and were rolled back to the previous version.
Before the incident started, only ~6 percent of instances in us-central1-a had been updated. This issue prevented the update of the rest.
Update backlogs processed gradually and eventually completed ~1 day later on 06 October.
Cloud SQL:
New Cloud SQL VM creations failed and Cloud SQL databases were unavailable if changes that resulted in a new VM being created were made to the existing instances. (e.g. Update, Self Service Maintenance, Clone/Failover, etc.)
Cloud Dataproc:
New cluster creations experienced elevated latencies and failures: up to 5 percent of new cluster creations failed in us-central1
Existing clusters may have failed to execute jobs.
Cloud Dataflow:
Existing jobs experienced degradation in acquiring resources in response to demand (horizontal scaling failures).
New jobs faced start up failure or extended initialization latencies.
Container downloads from Artifact Registry failed (resulting in failure to instantiate or horizontally scale workloads).
The issue impacted less than 5 percent of all Dataflow jobs.
Cloud Datastream:
On 5 October from 08:00 to 10:10 US/Pacific, streams in the us-central1 region experienced delayed ingestion of data due to a high restart rate of the stream’s pods which were in charge of scheduling the ingestion tasks.
Less than 5 percent of streams in us-central-1 were impacted.
The impacted streams had a spike in the “Data freshness” and “Total latency” monitoring metrics in this timeframe.
02:24 AM
Mini Incident Report
We apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support .
(All Times US/Pacific)
Incident Start: 5 October 2023 03:00
Incident End: 5 October 2023 11:00
Duration: 8 hours
Affected Services and Features:
Virtual Private Cloud
Google Kubernetes Engine
Google Compute Engine
Google Cloud Dataproc
Google Cloud Dataflow
Cloud Data Fusion
Cloud Filestore
Cloud SQL
Batch
Regions/Zones: us-central1
Description:
Multiple Google Cloud products experienced networking connectivity issues which impacted VMs in the us-central1 region for a duration of 8 hours. From preliminary analysis, the issue was due to a recent rollout of the management plane which caused the control plane for some traffic routers to run out of memory. This caused the routing policy in the data plane to become stale.
The issue was mitigated by rolling back the management plane change that triggered the issue. The memory allocation for the affected control plane component was increased to prevent recurrence of the issue.
Google will complete a full Incident Report in the following days that will provide a detailed root cause.
Customer Impact:
Virtual Private Cloud:
Newly created VMs for some projects in us-central1 experienced extended delays before networking became funcitonal.
Live migrated VMs for some projects in us-central1 experienced extended loss of connectivity after migrating.
Health check state of VMs in some projects in us-central1 were not being propagated to load balancers in a timely manner.
Cross-region traffic originating from or destined to us-central1 was more likely to be affected than intra-region traffic.
Google Kubernetes Engine:
Clusters may have experienced downtime after recreate, upgrades, and creations. No action was needed as they eventually fixed themselves within 30 to 120 minutes. The downtime was triggered by recreates etc, once a recreate etc has been initiated. During the downtime, customers might see in the Google Cloud Console Clusters page that some Nodes have not registered and the cluster is unhealthy.
Cloud Data Fusion:
New Cloud Data Fusion VM creation may have failed in the us-central1 region. This issue impacted existing instance operations in the us-central1 region.
Cloud Filestore:
New Filestore VMs created during upgrades, were unable to communicate with each other. Filestore upgrades may have failed and were rolled back to the previous version.
Cloud SQL:
New Cloud SQL VM creations failed and Cloud SQL databases were unavailable if changes that resulted in a new VM being created were made to the existing instances. (e.g. Update, SSM, Drain, Clone/Failover, etc.)
Cloud Dataproc:
New cluster creations experienced elevated latencies and failures.
Existing clusters may have failed to execute jobs.
Cloud Dataflow:
Existing jobs experienced degradation in acquiring resources in response to demand (horizontal scaling failures).
New jobs faced start up failure or extended initialization latencies.
Container downloads from Artifact Registry failed (resulting in failure to instantiate or horizontally scale workloads).
Any products or services reliant on VM creation may have observed impact for the duration of the incident. We are continuing to investigate and will provide further detail on additional impact in the full Incident Report.
07:29 PM
The issue with Batch, Cloud Data Fusion, Cloud Filestore, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Networking, Google Cloud SQL, Google Compute Engine, Google Kubernetes Engine, Virtual Private Cloud (VPC) has been resolved for all affected projects as of Thursday, 2023-10-05 11:12 US/Pacific.
We thank you for your patience while we worked on resolving the issue.
07:13 PM
Summary: Multiple cloud products are experiencing networking issues in us-central1
Description: Our engineering rolled out mitigation and our internal monitoring shows signs of recovery.
We are closely monitoring for full resolution. We do not have an ETA for full resolution at this point.
Cloud SQL impact is mitigated at 10:37 US/Pacific.
We will provide more information by Thursday, 2023-10-05 11:45 US/Pacific.
Diagnosis:
Networking Impact:
Newly created VMs in us-central1 may experience delays of a few minutes before the networking stack becomes fully operational.
Some customers are experiencing packet loss for cross-regional traffic coming into us-central1.
Some customers also experienced packet loss for VM network traffic in us-central1 between 07:00 US/Pacific and 08:15 US/Pacific on 2023-10-05
GKE impact :
Clusters might experience downtime after recreate, upgrades, and creations. No action is needed as they will eventually fix themselves within 30 to 120 minutes. The downtime is triggered by recreates etc, once a recreate etc has been initiated. During the downtime, customers might see in the Google Cloud Console Clusters page that some Nodes have not registered and the cluster is unhealthy.
Cloud SQL:
New Cloud SQL Instance creations will fail and Cloud SQL databases may be unavailable if changes are made to the existing instances (Update, SSM, Drain, Clone/Failover, etc.)
Cloud Datafusion
New Cloud Data Fusion instance creation might fail in the us-central1 region. This issue may impact existing instances operations in the us-central1 region.
Workaround: Customers can use unaffected regions where feasible.
06:58 PM
Summary: Multiple cloud products are experiencing networking issues in us-central1
Description: Our engineering rolled out mitigation and our internal monitoring shows signs of recovery.
We are closely monitoring for full resolution. We do not have an ETA for full resolution at this point.
We will provide more information by Thursday, 2023-10-05 11:45 US/Pacific.
Diagnosis:
Networking Impact:
Newly created VMs in us-central1 may experience delays of a few minutes before the networking stack becomes fully operational.
Some customers are experiencing packet loss for cross-regional traffic coming into us-central1.
Some customers also experienced packet loss for VM network traffic in us-central1 between 07:00 US/Pacific and 08:15 US/Pacific on 2023-10-05
GKE impact :
Clusters might experience downtime after recreate, upgrades, and creations. No action is needed as they will eventually fix themselves within 30 to 120 minutes. The downtime is triggered by recreates etc, once a recreate etc has been initiated
Cloud SQL:
New Cloud SQL Instance creations will fail and Cloud SQL databases may be unavailable if changes are made to the existing instances (Update, SSM, Drain, Clone/Failover, etc.)
Workaround: Customers can use unaffected regions where feasible.
06:35 PM
Summary: Multiple cloud products are experiencing networking issues in us-central1
Description: Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Thursday, 2023-10-05 11:10 US/Pacific.
Diagnosis:
Networking Impact:
Newly created VMs in us-central1 may experience delays of a few minutes before the networking stack becomes fully operational.
Some customers are experiencing packet loss for cross-regional traffic coming into us-central1.
Some customers also experienced packet loss for VM network traffic in us-central1 between 07:00 US/Pacific and 08:15 US/Pacific on 2023-10-05
GKE impact :
Clusters might experience downtime after recreate. No action is needed as they will eventually fix themselves within 30 to 60 minutes. The downtime is triggered by recreates, once a recreate has been initiated
Cloud SQL:
Cloud SQL databases may be unavailable if changes are made to the underlying VM (Update, Recreate, etc.)
Workaround: Customers can use unaffected regions where feasible.
06:21 PM
Summary: Multiple cloud products are experiencing networking issues in us-central1
Description: Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Thursday, 2023-10-05 11:00 US/Pacific.
Diagnosis:
Networking Impact:
Newly created VMs in us-central1 may experience delays of a few minutes before the networking stack becomes fully operational.
Some customers are experiencing packet loss for cross-regional traffic coming into us-central1.
Some customers also experienced packet loss for VM network traffic in us-central1 between 07:00 US/Pacific and 08:15 US/Pacific on 2023-10-05
GKE impact :
Clusters might experience downtime after recreate. No action is needed as they will eventually fix themselves. The downtime is triggered by recreates, once a recreate has been initiated
Workaround: Customers can use unaffected regions where feasible.
05:43 PM
Summary: Google Virtual Private Cloud and Google Kubernetes Engine are experiencing issues in us-central1
Description: Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Thursday, 2023-10-05 10:45 US/Pacific.
Diagnosis:
Networking Impact:
Newly created VMs in us-central1 may experience delays of a few minutes before the networking stack becomes fully operational.
Some customers are experiencing packet loss for cross-regional traffic coming into us-central1.
Some customers also experienced packet loss for VM network traffic in us-central1 between 07:00 US/Pacific and 08:15 US/Pacific on 2023-10-05
GKE impact :
Clusters might experience downtime after recreate. No action is needed as they will eventually fix themselves. The downtime is triggered by recreates, once a recreate has been initiated
Workaround: Customers can use unaffected regions where feasible.
05:15 PM
Summary: Google Virtual Private Cloud is experiencing network issues in us-central1
Description: Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Thursday, 2023-10-05 10:45 US/Pacific.
Diagnosis:
Newly created VMs in us-central1 may experience delays of a few minutes before the networking stack becomes fully operational.
Some customers are experiencing packet loss for cross-regional traffic coming into us-central1.
Some customers also experienced packet loss for VM network traffic in us-central1 between 07:00 US/Pacific and 08:15 US/Pacific on 2023-10-05
Workaround: Customers can use unaffected regions where feasible.
04:59 PM
Summary: Google Virtual Private Cloud is experiencing network issues in us-central1
Description: Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Thursday, 2023-10-05 10:30 US/Pacific.
Diagnosis: Newly created VMs in us-central1 may experience delays of a few minutes before the networking stack becomes fully operational.
Some customers also experienced packet loss for VM network traffic in us-central1 between 07:00 US/Pacific and 08:15 US/Pacific on 2023-10-05
Workaround: Customers can use unaffected regions where feasible.
04:35 PM
Summary: Connectivity issue impacting VM’s in us-central1
Description: Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Thursday, 2023-10-05 10:00 US/Pacific.
Diagnosis: Users may experience packet loss in region: us-central1
Workaround: None at this time
03:37 PM
Summary: Connectivity issue impacting newly created VM’s in us-central1
Description: Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Thursday, 2023-10-05 09:00 US/Pacific.
Diagnosis: Newly created VMs may experience delays of a few minutes before the networking stack becomes fully operational. Existing VMs should continue working as before.
Workaround: Deploying VM's into unaffected Zones.
03:21 PM
Summary: Connectivity issue impacting newly created VM’s in us-central1
Description: We are experiencing an issue with Google Cloud Networking beginning on Thursday, 2023-10-05 05:14 US/Pacific.
Our engineering team continues to investigate the issue.
We will provide an update by Thursday, 2023-10-05 07:50 US/Pacific with current details.
We apologize to all who are affected by the disruption.
Diagnosis: Newly created VMs may experience delays of a few minutes before the networking stack becomes fully operational. Existing VMs should continue working as before.
Workaround: Deploying VM's into unaffected Zones.