Google Cloud MINOR

GPU device plugin component version 0.1.11-gke.1 causing missing GPU features in 1.23 and 1.24

July 11, 2023 · 11:05 PM UTC – 02:47 AM UTC · Duration: 27h 42min

Affected Services

Google Kubernetes Engine

Timeline

02:47 AM
The issue with Google Kubernetes Engine has been resolved for all affected projects as of Wednesday, 2023-07-12 17:51 US/Pacific. We thank you for your patience while we worked on resolving the issue.
01:03 AM
Summary: GPU device plugin component version 0.1.11-gke.1 causing missing GPU features in 1.23 and 1.24 Description: Mitigation work is currently underway by our engineering team. We have identified a mitigation and started implementing it. The mitigation is expected to complete by Monday, 2023-07-17. We will continue to provide updates on any status changes. Permanent fix is expected to be released by 2023-07-19. We will provide more information by Monday, 2023-07-17 13:00 US/Pacific. Diagnosis: Customers impacted may not able to use time-sharing feature. Workaround: Upgrade the cluster version to 1.25 and 1.26
11:31 PM
Summary: GPU device plugin component version 0.1.11-gke.1 causing missing GPU features in 1.23 and 1.24 Description: We are experiencing an issue with Google Kubernetes Engine. The following GPU features are missing on clusters on 1.23 and 1.24 using component version 0.1.11-gke.1: prometheus metrics library version upgrade health checker for multi-instance GPU nvidia-modest device configuration GPU time-sharing Our engineering team continues to investigate the issue. We will provide an update by Tuesday, 2023-07-11 17:35 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: None at this time. Workaround: None at this time.
11:05 PM
Summary: GPU device plugin component version 0.1.11-gke.1 causing missing GPU features in 1.23 and 1.24 Description: We are experiencing an issue with Google Kubernetes Engine. The following GPU features are missing on clusters on 1.23 and 1.24 using component version 0.1.11-gke.1: prometheus metrics library version upgrade health checker for multi-instance GPU nvidia-modest device configuration GPU time-sharing Our engineering team continues to investigate the issue. We will provide an update by Tuesday, 2023-07-11 15:35 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: None at this time. Workaround: None at this time.