Google Cloud MAJOR
Large Vertex Model Garden Deployment may experience failures.
October 22, 2024 · 10:08 PM UTC – 04:30 AM UTC · Duration: 318h 22min
Affected Services
Cloud Machine LearningVertex AI Online Prediction
Timeline
07:31 PM
Mini Incident Report
We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support
(All Times US/Pacific)
Incident Start: 22 October 2024, 14:08
Incident End: 4 November 2024, 20:30
Duration: 13 days, 6 hours, 22 minutes
Affected Services and Features:
Vertex AI Online Prediction (Vertex Model Garden Deployments)
Regions/Zones: All regions except asia-southeast1, europe-west4, us-central1, us-east1, us-east4
Description:
Deployment of large models (those that require more than 100GB of disk size) in Vertex AI Online Prediction (Vertex Model Garden Deployments) failed in most of the regions for a duration of up to 13 days, 6 hours, 22 minutes starting on Tuesday, 22 October 2024 at 14:08 US/Pacific.
From preliminary analysis, the root cause of the issue is an internal storage provisioning configuration error that was implemented as part of a recent change.
Google engineers mitigated the impact by rolling back the configuration change that caused the issue.
Customer Impact:
Customers would have received errors stating “Model server never became ready”, while performing deployments during the period of impact.
Additional details:
As a workaround, customers were able to deploy in one of the non-impacted regions noted above.
04:52 AM
The issue with Vertex AI Online Prediction (Large Vertex Model Garden deployment failure) has been resolved for all affected users as of Monday, 2024-11-04 20:28 US/Pacific.
We thank you for your patience while we worked on resolving the issue.
02:52 AM
Summary: Large Vertex Model Garden Deployment may experience failures.
Description: Our engineering team is continuing to work on mitigating the issue.
We do not have an ETA for mitigation at this point.
We will provide more information by Monday, 2024-11-04 21:30 US/Pacific.
Diagnosis: Customers may experience failures with large Vertex Model Garden deployments when using L4 Graphics Processing Unit (GPU).
Workaround: None at this time.
09:46 PM
Summary: Large Vertex Model Garden Deployment may experience failures.
Description: Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Monday, 2024-11-04 19:00 US/Pacific.
Diagnosis: Customers may experience failures with large Vertex Model Garden deployments when using L4 Graphics Processing Unit (GPU).
Workaround: None at this time.
09:17 PM
Summary: Large Vertex Model Garden Deployment Failures
Description: We are experiencing an issue with Vertex Model Garden Deployments.
Our engineering team continues to investigate the issue.
We will provide an update by Monday, 2024-11-04 14:00 US/Pacific with current details.
We apologize to all who are affected by the disruption.
Diagnosis: Customers may experience failures with large Vertex Model Garden deployments (greater than 100GB) when deployed on a GKE Autopilot cluster.
Workaround: None at this time.