Google Cloud MINOR

Vertex AI custom training jobs failing if using more than 2GB ephemeral storage

August 16, 2024 · 07:44 PM UTC – 12:23 AM UTC · Duration: 4h 39min

Affected Services

Cloud Machine LearningVertex AI Training

Timeline

12:23 AM
The issue with Vertex AI Training has been resolved for all affected users as of Friday, 2024-08-16 16:07 US/Pacific. We thank you for your patience while we worked on resolving the issue. Thank you for choosing us.
08:03 PM
Summary: Vertex AI custom training jobs failing if using more than 2GB ephemeral storage Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2024-08-16 17:30 US/Pacific. Diagnosis: Custom Vertex AI training jobs running on GKE and using more than 2GB of ephemeral storage may fail with the error ""Pod ephemeral local storage usage exceeds the total limit of containers 2Gi." Workaround: None at this time.
07:58 PM
Summary: Vertex AI custom training jobs failing if using more than 2GB ephemeral storage Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2024-08-16 17:00 US/Pacific. Diagnosis: Custom Vertex AI training jobs running on GKE and using more than 2GB of ephemeral storage may fail with the error ""Pod ephemeral local storage usage exceeds the total limit of containers 2Gi." Workaround: None at this time.