Google Cloud MINOR
Vertex AI custom training jobs failing if using more than 2GB ephemeral storage
August 16, 2024 · 07:44 PM UTC – 12:23 AM UTC · Duration: 4h 39min
Affected Services
Cloud Machine LearningVertex AI Training
Timeline
12:23 AM
The issue with Vertex AI Training has been resolved for all affected users as of Friday, 2024-08-16 16:07 US/Pacific.
We thank you for your patience while we worked on resolving the issue.
Thank you for choosing us.
08:03 PM
Summary: Vertex AI custom training jobs failing if using more than 2GB ephemeral storage
Description: Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Friday, 2024-08-16 17:30 US/Pacific.
Diagnosis: Custom Vertex AI training jobs running on GKE and using more than 2GB of ephemeral storage may fail with the error ""Pod ephemeral local storage usage exceeds the total limit of containers 2Gi."
Workaround: None at this time.
07:58 PM
Summary: Vertex AI custom training jobs failing if using more than 2GB ephemeral storage
Description: Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Friday, 2024-08-16 17:00 US/Pacific.
Diagnosis: Custom Vertex AI training jobs running on GKE and using more than 2GB of ephemeral storage may fail with the error ""Pod ephemeral local storage usage exceeds the total limit of containers 2Gi."
Workaround: None at this time.