DRA: fix scheduler/resource claim controller race #124931

pohly · 2024-05-17T14:45:10Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

There was a race caused by having to update claim finalizer and status in two different operations:

Resource claim controller removes allocation, does not yet get to remove the finalizer.
Scheduler prepares an allocation, without adding the finalizer because it's there.
Controller removes finalizer.
Scheduler adds allocation.

This is an invalid state. Automatic checking found this during the execution of the "with translated parameters on single node.*supports sharing a claim sequentially" E2E test, but only when run stand-alone. When running in parallel (as in the CI), the bad outcome of the race did not occur.

Special notes for your reviewer:

The fix is to check that the finalizer is still set when adding the allocation. The apiserver doesn't check that because it doesn't know which finalizer goes with the allocation result. It could check for "some finalizer", but that is not guaranteed to be correct (could be some unrelated one).

Checking the finalizer can only be done with a JSON patch. Despite the complications, having the ability to add multiple pods concurrently to ReservedFor seems worth it (avoids expensive rescheduling or a local retry loop).

The resource claim controller doesn't need this, it can do a normal update which implicitly checks ResourceVersion.

Does this PR introduce a user-facing change?

DRA: using structured parameters with a claim that gets reused between pods may have led to a claim with an invalid state (allocated without a finalizer) which then caused scheduling of pods using the claim to stop.

There was a race caused by having to update claim finalizer and status in two different operations: - Resource claim controller removes allocation, does not yet get to remove the finalizer. - Scheduler prepares an allocation, without adding the finalizer because it's there. - Controller removes finalizer. - Scheduler adds allocation. This is an invalid state. Automatic checking found this during the execution of the "with translated parameters on single node.*supports sharing a claim sequentially" E2E test, but only when run stand-alone. When running in parallel (as in the CI), the bad outcome of the race did not occur. The fix is to check that the finalizer is still set when adding the allocation. The apiserver doesn't check that because it doesn't know which finalizer goes with the allocation result. It could check for "some finalizer", but that is not guaranteed to be correct (could be some unrelated one). Checking the finalizer can only be done with a JSON patch. Despite the complications, having the ability to add multiple pods concurrently to ReservedFor seems worth it (avoids expensive rescheduling or a local retry loop). The resource claim controller doesn't need this, it can do a normal update which implicitly checks ResourceVersion.

k8s-ci-robot · 2024-05-17T14:46:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2024-05-21T09:45:32Z

pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go

+	// JSON patch can only append to a non-empty array. An empty reservedFor gets
+	// omitted and even if it didn't, it would be null and not an empty array.
+	// Therefore we have to test and add if it's currently empty.
+	reservedForEntry := fmt.Sprintf(`{"resource": "pods", "name": %q, "uid": %q}`, pod.Name, pod.UID)


While it was fun playing around with JSON patch, I think this is taking it too far...

A simpler, more obvious approach would be to add a retry loop which uses normal Update calls and gets the latest claim on a conflict.

Implemented, ready for review again.

The JSON patch approach works, but it is complex. A retry loop is easier to understand (detect conflict, get new claim, try again). There is one additional API call (the get), but in practice this scenario is unlikely.

pohly · 2024-05-28T11:25:44Z

/retest

k8s-ci-robot · 2024-05-28T11:44:03Z

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-linter-hints	`434e786`	link	false	`/test pull-kubernetes-linter-hints`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

bart0sh · 2024-05-28T16:32:49Z

/triage accepted
/priority important-soon

k8s-ci-robot requested review from bart0sh and klueska May 17, 2024 14:45

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 17, 2024

bart0sh added this to Triage in SIG Node PR Triage May 20, 2024

pohly commented May 21, 2024

View reviewed changes

pohly changed the title ~~DRA: fix scheduler/resource claim controller race~~ WIP: DRA: fix scheduler/resource claim controller race May 21, 2024

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 21, 2024

pohly changed the title ~~WIP: DRA: fix scheduler/resource claim controller race~~ DRA: fix scheduler/resource claim controller race May 27, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2024

DRA: fix scheduler/resource claim controller race with retry

434e786

The JSON patch approach works, but it is complex. A retry loop is easier to understand (detect conflict, get new claim, try again). There is one additional API call (the get), but in practice this scenario is unlikely.

pohly force-pushed the dra-scheduler-prebind-fix branch from 843dca1 to 434e786 Compare May 28, 2024 07:49

k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRA: fix scheduler/resource claim controller race #124931

DRA: fix scheduler/resource claim controller race #124931

pohly commented May 17, 2024

k8s-ci-robot commented May 17, 2024

pohly May 21, 2024

pohly May 27, 2024

pohly commented May 28, 2024

k8s-ci-robot commented May 28, 2024 •

edited

bart0sh commented May 28, 2024

DRA: fix scheduler/resource claim controller race #124931

Are you sure you want to change the base?

DRA: fix scheduler/resource claim controller race #124931

Conversation

pohly commented May 17, 2024

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented May 17, 2024

pohly May 21, 2024

Choose a reason for hiding this comment

pohly May 27, 2024

Choose a reason for hiding this comment

pohly commented May 28, 2024

k8s-ci-robot commented May 28, 2024 • edited

bart0sh commented May 28, 2024

k8s-ci-robot commented May 28, 2024 •

edited