Pod is stuck in terminating with "MountVolume.SetUp failed for volume ... not registered" error #113289

tossmilestone · 2022-10-24T06:47:42Z

What happened?

Pod is stuck in terminating state when rapidly create and delete the pod, and the kubelet reported the volume setup error:

Oct 21 12:09:20 slave2 kubelet[4089]: E1021 12:09:20.197090    4089 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/secret/dc79f107-6576-4d98-a120-23336154393e-tls-cert-vol podName:dc79f107-6576-4d98-a120-23336154393e nodeName:}" failed. No retries permitted until 2022-10-21 12:11:22.19703932 +0800 CST m=+323223.503182418 (durationBeforeRetry 2m2s). Error: MountVolume.SetUp failed for volume "tls-cert-vol" (UniqueName: "kubernetes.io/secret/dc79f107-6576-4d98-a120-23336154393e-tls-cert-vol") pod "vela-core-8494697c8c-xgrwt" (UID: "dc79f107-6576-4d98-a120-23336154393e") : object "cpaas-system"/"vela-core-admission" not registered

It has the same error with #105204, but the k8s 1.23.13 we used have fixed the issue with the PR #108756.

It appears that the PR haven't totally fixed the issue. The cause may be if the pod is deleted when the volume manager is setting up the volume of the pod, the syncPod loop in kubelet won't add the reference count because it's in terminating state.

So the volume manager will still try to setup the volume of the pod but failed with the "not registed" error.

What did you expect to happen?

The pod should be deleted without the volume setup error.

How can we reproduce it (as minimally and precisely as possible)?

It's the possibel reproduce steps:

Create a pod with a configmap volume
After the pod is created, delete the pod immediately

Maybe the steps should be runned several times.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

v1.23.13

Cloud provider

None

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

CentOS 7.3

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

tossmilestone · 2022-10-24T06:48:34Z

/sig node

SergeyKanzhelev · 2022-11-09T18:55:01Z

/assign @rphillips

sounds pretty close to other bug we looked at earlier when clean up was failing causing many pod GC loops keep running. @rphillips to take a look if it's the same issue

SergeyKanzhelev · 2022-11-09T18:55:16Z

/triage accepted

k8s-triage-robot · 2023-02-07T19:45:47Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 · 2023-02-14T13:56:38Z

/remove-lifecycle stale

ugur99 · 2023-08-28T07:00:36Z

We have the same problem from time to time with two different k8s version: v1.24.10 and v1.23.7, but usually during the rollout process; old instances/pods get stuck in the termination phase with the following errors. Are there any updates on this issue?

kubelet logs:

E0828 00:58:36.944965 3293071 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/projected/3fdf6bd3-78d8-4ce6-a0ed-e97005a8d06b-kube-api-access-qlshz podName:3fdf6bd3-78d8-4ce6-a0ed-e97005a8d06b nodeName:}" failed. No retries permitted until 2023-08-28 01:00:38.94493703 +0000 UTC m=+218616.925334872 (durationBeforeRetry 2m2s). Error: MountVolume.SetUp failed for volume "kube-api-access-qlshz" (UniqueName: "kubernetes.io/projected/3fdf6bd3-78d8-4ce6-a0ed-e97005a8d06b-kube-api-access-qlshz") pod "pod-name-79687c7975-hrfmv" (UID: "3fdf6bd3-78d8-4ce6-a0ed-e97005a8d06b") : object "test-namespace"/"kube-root-ca.crt" not registered
2023-08-28 03:00:39.014	
I0828 01:00:39.014791 3293071 reconciler.go:258] "operationExecutor.MountVolume started for volume \"kube-api-access-qlshz\" (UniqueName: \"kubernetes.io/projected/3fdf6bd3-78d8-4ce6-a0ed-e97005a8d06b-kube-api-access-qlshz\") pod \"pod-name-79687c7975-hrfmv\" (UID: \"3fdf6bd3-78d8-4ce6-a0ed-e97005a8d06b\") " pod="test-namespace/pod-name-79687c7975-hrfmv"
2023-08-28 03:00:39.015	
E0828 01:00:39.014988 3293071 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/projected/3fdf6bd3-78d8-4ce6-a0ed-e97005a8d06b-kube-api-access-qlshz podName:3fdf6bd3-78d8-4ce6-a0ed-e97005a8d06b nodeName:}" failed. No retries permitted until 2023-08-28 01:02:41.014973186 +0000 UTC m=+218738.995371018 (durationBeforeRetry 2m2s). Error: MountVolume.SetUp failed for volume "kube-api-access-qlshz" (UniqueName: "kubernetes.io/projected/3fdf6bd3-78d8-4ce6-a0ed-e97005a8d06b-kube-api-access-qlshz") pod "pod-name-79687c7975-hrfmv" (UID: "3fdf6bd3-78d8-4ce6-a0ed-e97005a8d06b") : object "test-namespace"/"kube-root-ca.crt" not registered
2023-08-28 03:02:41.035	
I0828 01:02:41.034974 3293071 reconciler.go:258] "operationExecutor.MountVolume started for volume \"kube-api-access-qlshz\" (UniqueName: \"kubernetes.io/projected/3fdf6bd3-78d8-4ce6-a0ed-e97005a8d06b-kube-api-access-qlshz\") pod \"pod-name-79687c7975-hrfmv\" (UID: \"3fdf6bd3-78d8-4ce6-a0ed-e97005a8d06b\") " pod="test-namespace/pod-name-79687c7975-hrfmv"

kubectl describe pod pod-name-79687c7975-hrfmv:

Name:                      pod-name-79687c7975-hrfmv
Namespace:                 test-namespace
Node:                      worker02/ip
Status:                    Terminating (lasts 132m)
Termination Grace Period:  30s
Containers:
  pod-name:
    State:          Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Fri, 25 Aug 2023 13:17:59 +0200
      Finished:     Fri, 25 Aug 2023 14:19:12 +0200
    Ready:          False
    Restart Count:  0
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-qlshz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  86s (x70 over 131m)  kubelet  MountVolume.SetUp failed for volume "kube-api-access-qlshz" : object "test-namespace"/"kube-root-ca.crt" not registered

sharad740 · 2024-03-10T14:15:32Z

I fixed my this error

20m (x16 over 36m)       Warning   FailedMount                    Pod/frappe-bench-mariadb-0                                 MountVolume.SetUp failed for volume "pvc-88b8fbdf-42e4-4a2e-9290-
0b823bf677fa" : mount failed: exit status 32                                                                                                                                                  
Mounting command: mount                                                                                                                                                                       
Mounting arguments: -t nfs 10.0.2.10:/home/sharad/nfs/default/erpnext-data-frappe-bench-mariadb-0-pvc-88b8fbdf-42e4-4a2e-9290-0b823bf677fa /var/lib/kubelet/pods/61cf3bbe-21cc-4b98-b698-9f9ed
c83900f/volumes/kubernetes.io~nfs/pvc-88b8fbdf-42e4-4a2e-9290-0b823bf677fa                                                                                                                    
Output: mount: /var/lib/kubelet/pods/61cf3bbe-21cc-4b98-b698-9f9edc83900f/volumes/kubernetes.io~nfs/pvc-88b8fbdf-42e4-4a2e-9290-0b823bf677fa: bad option; for several filesystems (e.g. nfs, c
ifs) you might need a /sbin/mount.<type> helper program.

by installing nfs-common in my worker node

sudo apt install -y nfs-common

Hope you don't have such silly mistake

smileusd · 2024-03-18T05:57:43Z

I also met this issue too, the secret and configmap block the pod terminating when delete namespace

smileusd · 2024-03-18T05:58:03Z

/sig storage

PaulFurtado · 2024-05-03T11:54:06Z

@sharad740 that is a different mount error from the "not registered" issue.

@SergeyKanzhelev we're hitting this issue somewhat frequently in large clusters and it seems to be one of the few sorts of issue like this where restarting kubelet does not resolve the problem and we have to force delete the pod.

E0503 11:35:58.744234 4036992 projected.go:292] Couldn't get configMap browsers-dashboard/kube-root-ca.crt: object "browsers-dashboard"/"kube-root-ca.crt" not registered
E0503 11:35:58.744244 4036992 projected.go:198] Error preparing data for projected volume kube-api-access-dssjd for pod browsers-dashboard/session-agent-bc77c896-scv2f: object "browsers-dashboard"/"kube-root-ca.crt" not registered

This is on kubernetes v1.27.13

It seems to often happen for pods in the Terminating state for us and impacts all of the secret and configmap volumes, but no others.

olyazavr · 2024-05-03T20:13:07Z

^ For that case in particular, we see pods that have been running a while, and successfully mounting volumes start to fail to mount volumes while terminating. I can see it stop watching the secret that it needs after eviction starts, but before termination completes. Something peculiar is that the logs show that kubelet thinks this is a new pod just starting up as it is terminating, which then results in a problem because it cannot mount the volume for the secret it stopped watching

Eviction
I0503 06:40:32.966708 3490635 kubelet.go:2362] "SyncLoop DELETE" source="api" pods=[browsers-dashboard/session-agent-bc77c896-scv2f]
I0503 06:40:32.967683 3490635 kuberuntime_container.go:744] "Killing container with a grace period" pod="browsers-dashboard/session-agent-bc77c896-scv2f" podUID=f4597db6-4c03-4bee-9e0a-15cbced247bf containerName="app" containerID="containerd://9667033a35e9f90fe98351a266793c2b08250c4a7a09864e8b2de2c89a0c10a0" gracePeriod=86400

Final successful vol mount
I0503 06:40:12.440243 3490635 operation_generator.go:722] "MountVolume.SetUp succeeded for volume \"kube-public-cert\" (UniqueName: \"kubernetes.io/secret/f4597db6-4c03-4bee-9e0a-15cbced247bf-kube-public-cert\") pod \"session-agent-bc77c896-scv2f\" (UID: \"f4597db6-4c03-4bee-9e0a-15cbced247bf\") " pod="browsers-dashboard/session-agent-bc77c896-scv2f"

Stop watching the secret
I0503 06:45:47.492056 3490635 watch_based_manager.go:352] "Not acquired for long time, Stopped watching for changes" objectKey={namespace:browsers-dashboard name:kube-public-cert uid:} maxIdleTime="5m0s"

Pod is re-added?
I0503 06:59:10.489083 4036992 kuberuntime_manager.go:1370] "getSandboxIDByPodUID got sandbox IDs for pod" podSandboxID=[927c000a6fc6843dd7a39dc8b3428e0f7f7d3bdfef1da5b6c50abd13c4b867ca] pod="browsers-dashboard/session-agent-bc77c896-scv2f"
I0503 06:59:10.492329 4036992 generic.go:459] "PLEG: Write status" pod="browsers-dashboard/session-agent-bc77c896-scv2f"
I0503 06:59:11.468951 4036992 topology_manager.go:212] "Topology Admit Handler" podUID=f4597db6-4c03-4bee-9e0a-15cbced247bf podNamespace="browsers-dashboard" podName="session-agent-bc77c896-scv2f"
I0503 06:59:11.468931 4036992 kubelet.go:2346] "SyncLoop ADD" source="api" pods=[browsers-dashboard/session-agent-bc77c896-scv2f]
I0503 06:59:11.468859 4036992 config.go:398] "Receiving a new pod" pod="browsers-dashboard/session-agent-bc77c896-scv2f"

Disaster
I0503 06:59:14.490456 4036992 event.go:307] "Event occurred" object="browsers-dashboard/session-agent-bc77c896-scv2f" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FailedMount" message="MountVolume.SetUp failed for volume \"kube-public-cert\" : object \"browsers-dashboard\"/\"kube-public-cert\" not registered"

Several questions arise here:

Why does kubelet think the pod is re-added? And can something detect that it's actually not?
Why do we stop watching secrets for terminating pods?

kubernetes/pkg/kubelet/util/manager/watch_based_manager.go

Line 360 in e798b9c

klog.V(4).InfoS("Not acquired for long time, Stopped watching for changes", "objectKey", key, "maxIdleTime", c.maxIdleTime)
Since termination can take some time, why is registering the pod something reserved only for non-terminating pods? https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L1842-L1849

Update here- looks like kubelet had restarted here, so that explains the ADD

olyazavr · 2024-05-10T19:33:32Z

I got to the bottom of what was happening in our case, and it's possible it's the same thing that was happening in the original report:

In this pod worker loop, running pods get the SyncPod function, but terminating pods get the SyncTerminatingPod function. SyncPod has the pod registration flow, but SyncTerminatingPod does not, but this is needed for kubelet to know what secrets/configmaps back mounted volumes. I added in:

if kl.podWorkers.CouldHaveRunningContainers(pod.UID) {
		if kl.secretManager != nil {
			kl.secretManager.RegisterPod(pod)
		}
		if kl.configMapManager != nil {
			kl.configMapManager.RegisterPod(pod)
		}
	}

in SyncTerminatingPod and that resolved my issue- when kubelet restarted, terminating pods were still able to mount volumes and shut down correctly without getting stuck. I can see how if you create and delete a pod quickly, you could end up in a similar situation where the first sync would go through SyncTerminatingPod and the same volume issue would arise

tossmilestone added the kind/bug Categorizes issue or PR as related to a bug. label Oct 24, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 24, 2022

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 24, 2022

pacoxu added this to Triage in SIG Node Bugs Oct 24, 2022

k8s-ci-robot assigned rphillips Nov 9, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 9, 2022

SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs Nov 9, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 7, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 14, 2023

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Mar 18, 2024

olyazavr linked a pull request May 15, 2024 that will close this issue

register a terminating pod if containers running #124890

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod is stuck in terminating with "MountVolume.SetUp failed for volume ... not registered" error #113289

Pod is stuck in terminating with "MountVolume.SetUp failed for volume ... not registered" error #113289

tossmilestone commented Oct 24, 2022 •

edited

tossmilestone commented Oct 24, 2022

SergeyKanzhelev commented Nov 9, 2022

SergeyKanzhelev commented Nov 9, 2022

k8s-triage-robot commented Feb 7, 2023

vaibhav2107 commented Feb 14, 2023

ugur99 commented Aug 28, 2023

sharad740 commented Mar 10, 2024

smileusd commented Mar 18, 2024

smileusd commented Mar 18, 2024

PaulFurtado commented May 3, 2024

olyazavr commented May 3, 2024 •

edited

olyazavr commented May 10, 2024

Pod is stuck in terminating with "MountVolume.SetUp failed for volume ... not registered" error #113289

Pod is stuck in terminating with "MountVolume.SetUp failed for volume ... not registered" error #113289

Comments

tossmilestone commented Oct 24, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

tossmilestone commented Oct 24, 2022

SergeyKanzhelev commented Nov 9, 2022

SergeyKanzhelev commented Nov 9, 2022

k8s-triage-robot commented Feb 7, 2023

vaibhav2107 commented Feb 14, 2023

ugur99 commented Aug 28, 2023

sharad740 commented Mar 10, 2024

smileusd commented Mar 18, 2024

smileusd commented Mar 18, 2024

PaulFurtado commented May 3, 2024

olyazavr commented May 3, 2024 • edited

olyazavr commented May 10, 2024

tossmilestone commented Oct 24, 2022 •

edited

olyazavr commented May 3, 2024 •

edited