Google Cloud MINOR

Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3

March 4, 2023 · 05:56 AM UTC – 06:40 AM UTC · Duration: 24h 44min

Affected Services

Vertex AI TrainingCloud Machine Learning

Timeline

06:40 AM
The issue with Vertex AI Training has been resolved for all affected users as of Saturday, 2023-03-04 22:39 US/Pacific. We thank you for your patience while we worked on resolving the issue.
06:57 PM
Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3 Description: Mitigation work is currently underway by our engineering team. At this time, we believe the issue has been resolved for the us-central1 region, we are working to confirm and also working on mitigation for us-east1, and europe-west3. We do not have an ETA for mitigation in us-east1 and europe-west3 at this point. We will provide more information by Sunday, 2023-03-05 14:00 US/Pacific. Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3. Workaround: None at this time.
07:00 AM
Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3 Description: Mitigation work is currently underway by our engineering team. At this time, we believe the issue has been resolved for the us-central1 region, we are working to confirm and also working on mitigation for us-east1, and europe-west3. We do not have an ETA for mitigation in us-east1 and europe-west3 at this point. We will provide more information by Saturday, 2023-03-04 11:00 US/Pacific. Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3. Workaround: None at this time.
06:20 AM
Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3 Description: Mitigation work is currently underway by our engineering team. At this time, we believe the issue has been resolved for the us-central1 region and are working to confirm. We do not have an ETA for mitigation in us-east1 and europe-west3 at this point. We will provide more information by Friday, 2023-03-03 23:30 US/Pacific. Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3. Workaround: None at this time.