Google Cloud MINOR
Google Cloud Storage (GCS) in europe-west1 is experiencing unavailability errors
November 10, 2022 · 08:04 AM UTC – 04:07 PM UTC · Duration: 8h 3min
Affected Services
Google Cloud Storage
Timeline
08:37 PM
Incident Report
Summary
Starting on 10 November 2022 at 00:04 PST customers of Google Cloud Storage (GCS) and Google BigQuery may have seen intermittent error messages while using these services in europe-west1 for a duration of 8 hours and 3 minutes.
To our GCS and BigQuery customers who were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. We have conducted an internal investigation and are taking steps to improve our service.
Root Cause
This issue was caused by a recent rollout that was intended to improve maintenance, efficiency, and supportability by sharing internal data requests to new jobs. However, due to an issue in the rollout, the migration of the data resulted in transient faults and caused a failure rate of up to 7% of read and 19% of write traffic in europe-west1.
Remediation and Prevention
Google Engineers were alerted to this issue and immediately started to investigate the issue.
Engineers identified the problematic rollout above; however, the first two attempts to roll back this change were not effective to resolve the issue. Google engineers then identified a quicker and more effective, direct mitigation. This change took a couple of minutes to complete and fully mitigated the impact.
Google is committed to preventing a repeat of this issue in the future and is taking the following actions:
We have postponed the internal data migration rollout until all critical preventative action measures are resolved.
Going forward, we will pause binary rollouts and stop retry updates after Google engineers get alerted to paging events.
We will implement a two-stage canary deployment for rollouts to reduce the percentage of impacted tasks due to catastrophic error.
Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business.
Detailed Description of Impact
On Thursday, 10 November 2022 from 00:04 US/Pacific to 08:07 US/Pacific, 7% of read and 19% of write traffic in europe-west1 region was failing. All of the impact was limited to data operations, including read, write, rewrite, clone, compose, and upload of objects. Customers in the europe-west1 region may have experienced the following symptoms during this period:
Affected GCS customers may have received HTTP 503 errors for read/write operations in europe-west1. Metadata operations such as object listing continued to work successfully.
Affected customers of Google BigQuery may have received “INTERNAL_ERROR” when running import jobs in europe-west1 during the impact window.
10:42 PM
Mini Incident Report
We apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support or to Google Workspace Support using help article https://support.google.com/a/answer/1047213.
(All Times US/Pacific)
Incident Start: 10 November 2022 00:04
Incident End: 10 November 2022 08:07
Duration: 8 hours, 3 minutes
Affected Services and Features:
Google Cloud Storage
Google BigQuery
Regions/Zones: europe-west1
Description:
Google Cloud Storage experienced intermittent unavailability errors for a period of 8 hours and 3 minutes in europe-west1. From a preliminary analysis, the root cause of the issue was related to a recent change to network traffic routing. This change was rolled back to successfully mitigate the issue. Google will be providing a full Incident Report that will provide additional root cause information.
Customer Impact:
Google Cloud Storage customers would have received HTTP 503 errors for read/write operations in europe-west1. Metadata operations such as object listing continued to work successfully.
Google BigQuery customers may have received “INTERNAL_ERROR” when running import jobs in europe-west1 during the impact window.
04:41 PM
The issue with Google Cloud Storage has been resolved for all affected users as of Thursday, 2022-11-10 08:07 US/Pacific.
The mitigation applied by our engineering team worked as expected
We thank you for your patience while we worked on resolving the issue.
04:27 PM
Summary: Google Cloud Storage (GCS) in europe-west1 is experiencing unavailability errors
Description: We are experiencing an intermittent issue with Google Cloud Storage beginning on Thursday, 2022-11-10 00:04:43 PST US/Pacific.
This issue was suspected to be caused by a recently rolled out update. The Engineering team is rolling back the update and current data indicates that roll back is effective in mitigating this issue .
The mitigation is expected to completed by Thursday, 2022-11-10 08:40 US/Pacific.
We will provide more information by Thursday, 2022-11-10 09:00 US/Pacific.
Diagnosis: GCS users will experience 503 errors for many operations
Workaround: None at this time
03:17 PM
Summary: Google Cloud Storage (GCS) in europe-west1 is experiencing unavailability errors
Description: We are experiencing an intermittent issue with Google Cloud Storage beginning on Thursday, 2022-11-10 00:04:43 PST US/Pacific.
Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Thursday, 2022-11-10 08:35 US/Pacific.
Diagnosis: GCS users will experience 503 errors for many operations
Workaround: None at this time
02:41 PM
Summary: Google Cloud Storage (GCS) in europe-west1 is experiencing unavailability errors
Description: We are experiencing an intermittent issue with Google Cloud Storage beginning on Thursday, 2022-11-10 00:04:43 PST US/Pacific.
Our engineering team continues to investigate the issue.
We will provide an update by Thursday, 2022-11-10 07:20 US/Pacific with current details.
We apologize to all who are affected by the disruption.
Diagnosis: GCS users will experience 503 errors for many operations
Workaround: None at this time
02:14 PM
Summary: Google Cloud Storage (GCS) in europe-west1 is experiencing unavailability errors
Description: We are experiencing an intermittent issue with Google Cloud Storage beginning on Thursday, 2022-11-10 05:20:04 PST US/Pacific.
Our engineering team continues to investigate the issue.
We will provide an update by Thursday, 2022-11-10 06:45 US/Pacific with current details.
We apologize to all who are affected by the disruption.
Diagnosis: GCS users will experience errors for many operations
Workaround: None at this time
02:08 PM
Summary: Google Cloud Storage (GCS) in europe-west1 is experience unavailable errors
Description: We are experiencing an intermittent issue with Google Cloud Storage beginning on Thursday, 2022-11-10 05:20:04 PST US/Pacific.
Our engineering team continues to investigate the issue.
We will provide an update by Thursday, 2022-11-10 06:45 US/Pacific with current details.
We apologize to all who are affected by the disruption.
Diagnosis: GCS users will experience errors for many operations
Workaround: None at this time