-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA: fix scheduler/resource claim controller race #124931
base: master
Are you sure you want to change the base?
Conversation
There was a race caused by having to update claim finalizer and status in two different operations: - Resource claim controller removes allocation, does not yet get to remove the finalizer. - Scheduler prepares an allocation, without adding the finalizer because it's there. - Controller removes finalizer. - Scheduler adds allocation. This is an invalid state. Automatic checking found this during the execution of the "with translated parameters on single node.*supports sharing a claim sequentially" E2E test, but only when run stand-alone. When running in parallel (as in the CI), the bad outcome of the race did not occur. The fix is to check that the finalizer is still set when adding the allocation. The apiserver doesn't check that because it doesn't know which finalizer goes with the allocation result. It could check for "some finalizer", but that is not guaranteed to be correct (could be some unrelated one). Checking the finalizer can only be done with a JSON patch. Despite the complications, having the ability to add multiple pods concurrently to ReservedFor seems worth it (avoids expensive rescheduling or a local retry loop). The resource claim controller doesn't need this, it can do a normal update which implicitly checks ResourceVersion.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pohly The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// JSON patch can only append to a non-empty array. An empty reservedFor gets | ||
// omitted and even if it didn't, it would be null and not an empty array. | ||
// Therefore we have to test and add if it's currently empty. | ||
reservedForEntry := fmt.Sprintf(`{"resource": "pods", "name": %q, "uid": %q}`, pod.Name, pod.UID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it was fun playing around with JSON patch, I think this is taking it too far...
A simpler, more obvious approach would be to add a retry loop which uses normal Update calls and gets the latest claim on a conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented, ready for review again.
The JSON patch approach works, but it is complex. A retry loop is easier to understand (detect conflict, get new claim, try again). There is one additional API call (the get), but in practice this scenario is unlikely.
843dca1
to
434e786
Compare
/retest |
@pohly: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/triage accepted |
What type of PR is this?
/kind bug
What this PR does / why we need it:
There was a race caused by having to update claim finalizer and status in two different operations:
This is an invalid state. Automatic checking found this during the execution of the "with translated parameters on single node.*supports sharing a claim sequentially" E2E test, but only when run stand-alone. When running in parallel (as in the CI), the bad outcome of the race did not occur.
Special notes for your reviewer:
The fix is to check that the finalizer is still set when adding the allocation. The apiserver doesn't check that because it doesn't know which finalizer goes with the allocation result. It could check for "some finalizer", but that is not guaranteed to be correct (could be some unrelated one).
Checking the finalizer can only be done with a JSON patch. Despite the complications, having the ability to add multiple pods concurrently to ReservedFor seems worth it (avoids expensive rescheduling or a local retry loop).
The resource claim controller doesn't need this, it can do a normal update which implicitly checks ResourceVersion.
Does this PR introduce a user-facing change?