Google Cloud MINOR

Global: Calico enabled GKE clusters’ pods may get stuck Terminating or Pending after upgrading to 1.22+

September 16, 2022 · 12:01 AM UTC – 09:49 PM UTC · Duration: 333h 48min

Affected Services

Google Kubernetes Engine

Timeline

09:49 PM
The issue with Google Kubernetes Engine has been resolved for all affected users as of Thursday, 2022-09-29 13:45 US/Pacific. A fix is available in GKE v1.24.4-gke.800 and available in v1.23 and v1.22 Customers can manually upgrade to the fixed version. Or, Clusters on the RAPID, REGULAR or STABLE release channels using 1.22 or 1.23 will upgrade automatically over coming weeks.
11:48 PM
Summary: Global: Calico enabled GKE clusters’ pods may get stuck Terminating or Pending after upgrading to 1.22+ Description: The following GKE versions are vulnerable to a race condition when using the Calico Network Policy, resulting in pods stuck Terminating or Pending: All 1.22 GKE versions All 1.23 GKE versions 1.24 versions before 1.24.4-gke.800 Only a small number of GKE clusters have actually experienced stuck pods. Use of cluster autoscaler can increase the chance of hitting the race condition. A fix is available in GKE v1.24.4-gke.800 or later. The fix is also being made available in v1.23 and v1.22, as part of the next release, which has now started. Once available, customers can manually upgrade to the fixed version. Or, Clusters on the RAPID, REGULAR or STABLE release channels using 1.22 or 1.23 will upgrade automatically over coming weeks. We will provide an update by Friday, 2022-09-30 15:00 US/Pacific with current details. The issue was introduced in the Calico component, and GKE has been working closely with the Calico project to produce a fix. We apologize to all who are affected by the disruption. Diagnosis: The Calico CNI plugin shows the following error terminating Pods: “Warning FailedKillPod 36m (x389 over 121m) kubelet error killing pod: failed to "KillPodSandbox" for "af9ab8f9-d6d6-4828-9b8c-a58441dd1f86" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "myclient-pod-6474c76996" network: error getting ClusterInformation: connection is unauthorized: Unauthorized" Workaround: Customers currently experiencing the issue, are requested to take one of the following actions: [Recommended] Manually upgrade to GKE v1.24.4-gke.800 or later (if viable), or reach out to Google Cloud Support to have an internal patch applied Restart the kubelet and calico-node to get the pods unstuck.
11:06 PM
Summary: Global: Calico enabled GKE clusters’ pods may get stuck terminating after upgrading to 1.22+ Description: The following GKE versions are vulnerable to a race condition when using the Calico Network Policy, resulting in pods stuck Terminating or Pending: All 1.22 GKE versions All 1.23 GKE versions 1.24 versions before 1.24.4-gke.800 Only a small number of GKE clusters have actually experienced stuck pods. Use of cluster autoscaler can increase the chance of hitting the race condition. A fix is available in GKE v1.24.4-gke.800 or later. The fix is also being made available in v1.23 and v1.22, as part of the next release. Once available, customers can manually upgrade to the fixed version. Or, Clusters on the RAPID, REGULAR or STABLE release channels using 1.22 or 1.23 will upgrade automatically over coming weeks. We will provide an update by Friday, 2022-09-23 16:00 US/Pacific with current details. The issue was introduced in the Calico component, and GKE has been working closely with the Calico project to produce a fix. We apologize to all who are affected by the disruption. Diagnosis: The Calico CNI plugin shows the following error terminating Pods: “Warning FailedKillPod 36m (x389 over 121m) kubelet error killing pod: failed to "KillPodSandbox" for "af9ab8f9-d6d6-4828-9b8c-a58441dd1f86" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "myclient-pod-6474c76996" network: error getting ClusterInformation: connection is unauthorized: Unauthorized" Workaround: Customers currently experiencing the issue, are requested to take one of the following actions: [Recommended] Manually upgrade to GKE v1.24.4-gke.800 or later (if viable), or reach out to Google Cloud Support to have an internal patch applied Restart the kubelet and calico-node to get the pods unstuck.
10:56 PM
Summary: Global: Calico enabled GKE clusters’ pods may get stuck terminating after upgrading to 1.22+ Description: The following GKE versions are vulnerable to a race condition when using the Calico Network Policy, resulting in pods stuck Terminating or Pending: All 1.22 GKE versions All 1.23 GKE versions 1.24 versions before 1.24.4-gke.800 Only a small number of GKE clusters have actually experienced stuck pods. Use of cluster autoscaler can increase the chance of hitting the race condition. A fix is available in GKE v1.24.4-gke.800 or later. The fix is also being made available in v1.23 and v1.22, as part of the next release. Once available, customers can manually upgrade to the fixed version. Or, Clusters on the RAPID, REGULAR or STABLE release channels using 1.22 or 1.23 will upgrade automatically over coming weeks. We will provide an update by Friday, 2022-09-23 15:00 US/Pacific with current details. The issue was introduced in the Calico component, and GKE has been working closely with the Calico project to produce a fix. We apologize to all who are affected by the disruption. Diagnosis: The Calico CNI plugin shows the following error terminating Pods: “Warning FailedKillPod 36m (x389 over 121m) kubelet error killing pod: failed to "KillPodSandbox" for "af9ab8f9-d6d6-4828-9b8c-a58441dd1f86" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "myclient-pod-6474c76996" network: error getting ClusterInformation: connection is unauthorized: Unauthorized" Workaround: Customers currently experiencing the issue, are requested to take one of the following actions: [Recommended] Manually upgrade to GKE v1.24.4-gke.800 or later (if viable), or reach out to Google Cloud Support to have an internal patch applied Restart the kubelet and calico-node to get the pods unstuck.
11:22 PM
Summary: Global: Calico enabled GKE clusters’ pods may get stuck terminating after upgrading to 1.22+ Description: GKE clusters running the following versions that use Calico Network Policy might experience issues with pods under some conditions. All 1.22 GKE versions All 1.23 GKE versions 1.24 versions before 1.24.4-gke.800 A fix is available in GKE v1.24.4-gke.800 or later. After qualification completes, we will expedite the backport of the fix to 1.22 and 1.23. Clusters on the RAPID, REGULAR or STABLE release channels using 1.22 or 1.23 will upgrade automatically over coming weeks, or customers can manually upgrade to the fixed version. We will provide an update by Wednesday, 2022-09-21 15:00 US/Pacific with current details. The issue was introduced in the Calico component, and GKE has been working closely with the Calico project to produce a fix. We apologize to all who are affected by the disruption. Diagnosis: The Calico CNI plugin shows the following error terminating Pods: “Warning FailedKillPod 36m (x389 over 121m) kubelet error killing pod: failed to "KillPodSandbox" for "af9ab8f9-d6d6-4828-9b8c-a58441dd1f86" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "myclient-pod-6474c76996" network: error getting ClusterInformation: connection is unauthorized: Unauthorized" Workaround: Affected customers may try the following: [Recommended] Customers on affected versions can reach out to Google Cloud Support to have an internal patch applied. Customers can restart the kubelet and calico-node to get the pods unstuck.
12:01 AM
Summary: Global: Calico enabled GKE clusters’ pods may get stuck terminating after upgrading to 1.22+ Description: GKE clusters running versions 1.22 or later and that use Calico Network Policy might experience issues with terminating Pods under some conditions. Our engineering team continues to investigate the issue and are qualifying a potential mitigation for release to the Rapid channel 1.24. After all the qualifications are done, we will expedite the backport of the fix to 1.22 as soon as possible. We will provide an update by Friday, 2022-09-16 15:00 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: The Calico CNI plugin will show the following error terminating Pods: “Warning FailedKillPod 36m (x389 over 121m) kubelet error killing pod: failed to "KillPodSandbox" for "af9ab8f9-d6d6-4828-9b8c-a58441dd1f86" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "myclient-pod-6474c76996" network: error getting ClusterInformation: connection is unauthorized: Unauthorized" Workaround: Affected customers may try the following: Restart the kubelet and calico-node can help getting the pods unstuck. Disable the Calico network policy. (workaround #1 is recommended, as this workaround is only viable if the customer does not have a strong need for Calico).