Google Cloud MINOR
Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3
March 4, 2023 · 05:56 AM UTC – 06:40 AM UTC · Duration: 24h 44min
Affected Services
Vertex AI TrainingCloud Machine Learning
Timeline
06:40 AM
The issue with Vertex AI Training has been resolved for all affected users as of Saturday, 2023-03-04 22:39 US/Pacific.
We thank you for your patience while we worked on resolving the issue.
06:57 PM
Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3
Description: Mitigation work is currently underway by our engineering team.
At this time, we believe the issue has been resolved for the us-central1 region, we are working to confirm and also working on mitigation for us-east1, and europe-west3.
We do not have an ETA for mitigation in us-east1 and europe-west3 at this point.
We will provide more information by Sunday, 2023-03-05 14:00 US/Pacific.
Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3.
Workaround: None at this time.
07:00 AM
Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3
Description: Mitigation work is currently underway by our engineering team.
At this time, we believe the issue has been resolved for the us-central1 region, we are working to confirm and also working on mitigation for us-east1, and europe-west3.
We do not have an ETA for mitigation in us-east1 and europe-west3 at this point.
We will provide more information by Saturday, 2023-03-04 11:00 US/Pacific.
Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3.
Workaround: None at this time.
06:20 AM
Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3
Description: Mitigation work is currently underway by our engineering team.
At this time, we believe the issue has been resolved for the us-central1 region and are working to confirm.
We do not have an ETA for mitigation in us-east1 and europe-west3 at this point.
We will provide more information by Friday, 2023-03-03 23:30 US/Pacific.
Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3.
Workaround: None at this time.