Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet crashes on concurrent map iteration and map write #124839

Open
kangzhiqin opened this issue May 13, 2024 · 4 comments · May be fixed by #124845
Open

Kubelet crashes on concurrent map iteration and map write #124839

kangzhiqin opened this issue May 13, 2024 · 4 comments · May be fixed by #124845
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@kangzhiqin
Copy link

What happened?

When running e2e, the following fatal error can be encountered:

  30715 I0513 08:25:43.792330  285839 reflector.go:289] Starting reflector *v1.ConfigMap (0s) from object-"flexvolume-445 9"/"kube-root-ca.crt"
  30716 I0513 08:25:43.797355  285839 plugins.go:651] "Loaded volume plugin" pluginName="k8s/attachable-with-long-mount-flexvolume-4459"
  30717 I0513 08:25:43.797400  285839 plugins.go:651] "Loaded volume plugin" pluginName="k8s/dummy-attachable"
  30718 I0513 08:25:43.851107  285839 reconciler.go:352] "attacherDetacher.AttachVolume started" volumeName="k8s/attachable-with-long-mount-flexvolume-4459/test-long-detach-flex" nodeName="master" scheduledPods=["        flexvolume-4459/flexvolume-detach-test-client"]
  30719 I0513 08:25:43.854533  285839 operation_generator.go:400] AttachVolume.Attach succeeded for volume "test-long-detach-flex" (UniqueName: "k8s/attachable-with-long-mount-flexvolume-4459/test-long-detach-flex        ") from node "master"
  30720 I0513 08:25:43.854661  285839 event.go:307] "Event occurred" object="flexvolume-4459/flexvolume-detach-test-client" fieldPath="" kind="Pod" apiVersion="v1" type="Normal" reason="SuccessfulAttachVolume" mes        sage="AttachVolume.Attach succeeded for volume \"test-long-detach-flex\" "
  30721 I0513 08:25:43.864876  285839 plugins.go:651] "Loaded volume plugin" pluginName="k8s/attachable-with-long-mount-flexvolume-4459"
  30722 I0513 08:25:43.866879  285839 desired_state_of_world.go:351] "volume add to dsw" volume="test-long-detach-flex" podName="ee90c357-99a0-467c-b829-1525c220da21"
  30723 I0513 08:25:43.866933  285839 desired_state_of_world.go:351] "volume add to dsw" volume="kube-api-access-4525n" podName="ee90c357-99a0-467c-b829-1525c220da21"
  30724 I0513 08:25:43.971490  285839 plugins.go:651] "Loaded volume plugin" pluginName="k8s/dummy-attachable"
  30725 fatal error: concurrent map iteration and map write
  30726 
  30727 goroutine 4078 [running]:
  30728 k8s.io/kubernetes/pkg/volume.(*VolumePluginMgr).FindPluginBySpec(0xc00b0bc008, 0xc00b944168)
  30729         pkg/volume/plugins.go:676 +0x37f
  30730 k8s.io/kubernetes/pkg/volume/util/operationexecutor.(*operationGenerator).GenerateMountVolumeFunc(0xc00b08ef50, 0x8bb2c97000, {{0xc0150614a0, 0x4c}, {0xc0135d7a70, 0x24}, 0xc00b944168, {0xc00c9bfac0, 0xf},         0xc015846488, ...}, ...)

The related code is as follows:
Added read lock on line 634:

func (pm *VolumePluginMgr) FindPluginBySpec(spec *Spec) (VolumePlugin, error) {
pm.mutex.RLock()
defer pm.mutex.RUnlock()
if spec == nil {
return nil, fmt.Errorf("could not find plugin because volume spec is nil")
}
var match VolumePlugin
matchedPluginNames := []string{}
for _, v := range pm.plugins {
if v.CanSupport(spec) {
match = v
matchedPluginNames = append(matchedPluginNames, v.GetPluginName())
}
}
pm.refreshProbedPlugins()
for _, plugin := range pm.probedPlugins {
if plugin.CanSupport(spec) {
match = plugin
matchedPluginNames = append(matchedPluginNames, plugin.GetPluginName())
}
}
if len(matchedPluginNames) == 0 {
return nil, fmt.Errorf("no volume plugin matched")
}
if len(matchedPluginNames) > 1 {
return nil, fmt.Errorf("multiple volume plugins matched: %s", strings.Join(matchedPluginNames, ","))
}
return match, nil
}

delete map on line 714:
func (pm *VolumePluginMgr) refreshProbedPlugins() {
events, err := pm.prober.Probe()
if err != nil {
klog.ErrorS(err, "Error dynamically probing plugins")
}
// because the probe function can return a list of valid plugins
// even when an error is present we still must add the plugins
// or they will be skipped because each event only fires once
for _, event := range events {
if event.Op == ProbeAddOrUpdate {
if err := pm.initProbedPlugin(event.Plugin); err != nil {
klog.ErrorS(err, "Error initializing dynamically probed plugin",
"pluginName", event.Plugin.GetPluginName())
continue
}
pm.probedPlugins[event.Plugin.GetPluginName()] = event.Plugin
} else if event.Op == ProbeRemove {
// Plugin is not available on ProbeRemove event, only PluginName
delete(pm.probedPlugins, event.PluginName)
} else {
klog.ErrorS(nil, "Unknown Operation on PluginName.",
"pluginName", event.Plugin.GetPluginName())
}
}
}

What did you expect to happen?

What's a good way for the community to solve this problem?

How can we reproduce it (as minimally and precisely as possible)?

None at present

Anything else we need to know?

No response

Kubernetes version

1.28.1 ```console $ kubectl version # paste output here ```

Cloud provider

false

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@kangzhiqin kangzhiqin added the kind/bug Categorizes issue or PR as related to a bug. label May 13, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 13, 2024
@saschagrunert
Copy link
Member

I think the easiest way is to leave the code as-is and guard with an exclusive lock: #124845

@saschagrunert
Copy link
Member

/sig storage

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 13, 2024
@saschagrunert
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 13, 2024
@kangzhiqin
Copy link
Author

I think the easiest way is to leave the code as-is and guard with an exclusive lock: #124845

Thank you for your reply. I'll think about this solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants