Google Cloud MINOR
GPU device plugin component version 0.1.11-gke.1 causing missing GPU features in 1.23 and 1.24
July 11, 2023 · 11:05 PM UTC – 02:47 AM UTC · Duration: 27h 42min
Affected Services
Google Kubernetes Engine
Timeline
02:47 AM
The issue with Google Kubernetes Engine has been resolved for all affected projects as of Wednesday, 2023-07-12 17:51 US/Pacific.
We thank you for your patience while we worked on resolving the issue.
01:03 AM
Summary: GPU device plugin component version 0.1.11-gke.1 causing missing GPU features in 1.23 and 1.24
Description: Mitigation work is currently underway by our engineering team.
We have identified a mitigation and started implementing it. The mitigation is expected to complete by Monday, 2023-07-17. We will continue to provide updates on any status changes. Permanent fix is expected to be released by 2023-07-19.
We will provide more information by Monday, 2023-07-17 13:00 US/Pacific.
Diagnosis: Customers impacted may not able to use time-sharing feature.
Workaround: Upgrade the cluster version to 1.25 and 1.26
11:31 PM
Summary: GPU device plugin component version 0.1.11-gke.1 causing missing GPU features in 1.23 and 1.24
Description: We are experiencing an issue with Google Kubernetes Engine.
The following GPU features are missing on clusters on 1.23 and 1.24 using component version 0.1.11-gke.1:
prometheus metrics library version upgrade
health checker for multi-instance GPU
nvidia-modest device configuration
GPU time-sharing
Our engineering team continues to investigate the issue.
We will provide an update by Tuesday, 2023-07-11 17:35 US/Pacific with current details.
We apologize to all who are affected by the disruption.
Diagnosis: None at this time.
Workaround: None at this time.
11:05 PM
Summary: GPU device plugin component version 0.1.11-gke.1 causing missing GPU features in 1.23 and 1.24
Description: We are experiencing an issue with Google Kubernetes Engine.
The following GPU features are missing on clusters on 1.23 and 1.24 using component version 0.1.11-gke.1:
prometheus metrics library version upgrade
health checker for multi-instance GPU
nvidia-modest device configuration
GPU time-sharing
Our engineering team continues to investigate the issue.
We will provide an update by Tuesday, 2023-07-11 15:35 US/Pacific with current details.
We apologize to all who are affected by the disruption.
Diagnosis: None at this time.
Workaround: None at this time.