Google Cloud CRITICAL
Global: Vertex AI Online Prediction Is Experiencing Increased Error Rates
June 2, 2022 · 06:10 PM UTC – 10:30 PM UTC · Duration: 4h 20min
Affected Services
Vertex AI Online PredictionCloud Machine Learning
Timeline
09:06 PM
Mini Incident Report
We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Support by opening a case https://cloud.google.com/support or help article https://support.google.com/a/answer/1047213.
(All Times US/Pacific)
Incident Start: 02 June 2022 10:10 US/Pacific
Incident End: 02 June 2022 14:30 US/Pacific
Duration: 4 hours, 20 minutes
Affected Services and Features:
Vertex AI Online Prediction
Regions/Zones: Global
Description:
Vertex AI Online Prediction experienced increased error rates from 30% up to 100% per region depending on user usage patterns for a duration of 4 hours, 20 minutes. From preliminary analysis, the root cause of the issue was that Vertex Prediction Endpoints were globally marked as deleted due to faulty resource cleanup process. The service fully recovered when the Vertex Prediction Endpoints were restored.
Customer Impact:
Affected customers may have experienced:
All pre-existing Vertex models undeployed on Vertex AI Endpoints
Empty responses when listing the deployed models
Runtime exceptions and general errors on Predict and Explain requests
Quota failure when trying to re-deploy models
11:24 PM
The issue with Vertex AI Online Prediction has been resolved for all affected users as of Thursday, 2022-06-02 15:21 US/Pacific.
We will publish an analysis of this incident once we have completed our internal investigation.
We thank you for your patience while we worked on resolving the issue.
10:51 PM
Summary: Global: Vertex AI Online Prediction Is Experiencing Increased Error Rates
Description: We believe the issue with Vertex AI Online Prediction is partially resolved.
We do not have an ETA for full resolution at this point.
We will provide an update by Thursday, 2022-06-02 16:10 US/Pacific with current details.
Diagnosis: For affected customers: When listing the deployed models in Endpoints, the list will be empty and Predict and Explain requests would fail.
Workaround: None at this time.
10:50 PM
Summary: Global: Vertex AI Online Prediction Is Experiencing Increased Error Rates
Description: Mitigation work is currently underway by our engineering team.
The mitigation is expected to complete by Thursday, 2022-06-02 15:07 US/Pacific.
We will provide more information by Thursday, 2022-06-02 15:07 US/Pacific.
Diagnosis: For affected customers: When listing the deployed models in Endpoints, the list will be empty and Predict and Explain requests would fail.
Workaround: None at this time.
10:06 PM
Summary: Global: Vertex AI Online Prediction Is Experiencing Increased Error Rates
Description: Mitigation work is currently underway by our engineering team.
The mitigation is expected to complete by Thursday, 2022-06-02 15:07 US/Pacific.
We will provide more information by Thursday, 2022-06-02 15:07 US/Pacific.
Diagnosis: Customers will experiences increased error rates.
Workaround: None at this time.
09:58 PM
Summary: Global: Vertex AI Online Prediction Is Experiencing Increased Error Rates
Description: We are experiencing an issue with Vertex AI Online Prediction beginning at Thursday, 2022-06-02 10:20 US/Pacific.
Our engineering team continues to investigate the issue.
We will provide an update by Thursday, 2022-06-02 14:30 US/Pacific with current details.
We apologize to all who are affected by the disruption.
Diagnosis: Customers will experiences errors
Workaround: None at this time.