Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler: Share frameworkImpl.waitingPods among profiles #122945

Closed
NoicFank opened this issue Jan 24, 2024 · 8 comments · Fixed by #122946 or #124926
Closed

Scheduler: Share frameworkImpl.waitingPods among profiles #122945

NoicFank opened this issue Jan 24, 2024 · 8 comments · Fixed by #122946 or #124926
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@NoicFank
Copy link
Contributor

What would you like to be added?

I hope to share the waitingPods among multiple profiles, instead of creating a new waitingPods in each profiles with the current implementation.

In other words, waitingPods is only instantiated once in a scheduler app.

f := &frameworkImpl{
registry: r,
snapshotSharedLister: options.snapshotSharedLister,
scorePluginWeight: make(map[string]int),
waitingPods: newWaitingPodsMap(),
clientSet: options.clientSet,

@Huang-Wei @ahg-g PTAL, Is there any negative impact of this change that I didn't consider, thanks.

Why is this needed?

In some scenarios, I need to traverse all pods waiting in the permit stage within the current scheduler app, rather than the pods that using the current profile for scheduling and waiting in the permit stage.

For example, we want use coscheduling to achieve all-or-nothing. And we have abstracted the concept of a Cluster, where there are two deployments under each cluster. Each deployments use different scheduler profile, and one cluster corresponds to one podGroup, as following:
image

Then, we want pods in one cluster (podA1, podA2, podB1) to be scheduled succeed/failed together.

We assume that all pods can pass through the filter plugins, and podA2 is the last to be scheduled. At this point, podA1 and podB1 are both waiting in the permit stage. Afterwards, we traverse the waitingPods to complete the permit waiting for podA1 and podB1. However, from an implementation perspective, now we can only see podA1 in waitingPods, podB1 cannot be seen, becausepodA2 & podB1 use different scheduling profiles. Which will cause podB1 timeout during the permit phase, even if all pods within the podGroup are successfully scheduled.

@NoicFank NoicFank added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 24, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2024
@NoicFank
Copy link
Contributor Author

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 24, 2024
@NoicFank
Copy link
Contributor Author

/assign @Huang-Wei @ahg-g

@NoicFank
Copy link
Contributor Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 25, 2024
@kerthcet
Copy link
Member

So the Cluster is cross-k8s-clusters object, and these two different deployments under different k8s-cluster, right?

@NoicFank
Copy link
Contributor Author

So the Cluster is cross-k8s-clusters object, and these two different deployments under different k8s-cluster, right?

Not really, these two deployments are located in the same k8s-cluster

It can be understood that the Cluster in the example above is roughly equivalent to an namespace, where these two deployments are located in the same ns, and we hope that all pods managed by the two deployments under this ns can be scheduled successfully or fail simultaneously.

@kerthcet
Copy link
Member

kerthcet commented Jan 26, 2024

Can you elaborate more? Why these deployments should be scheduled together, I can image this might be useful, just curious about the user story. Thanks.

@NoicFank
Copy link
Contributor Author

NoicFank commented Jan 26, 2024

Can you elaborate more? Why these deployments should be scheduled together, I can image this might be useful, just curious about the user story. Thanks.

Of course, our specific usage scenario is for stateful service (DB). For a DB instance (which is a Cluster mentioned above), there will be many pods under each DB instance. We will divide those pods into different components according to their functions (mainly different images), and for each component, we will use a workload(sts/deploy/...) to manage it. Overall, there will be multiple workloads under each DB instance, and each workload will manage its own pods.

For scheduling, we need to consider successfully scheduling all pods under the same instance to ensure DB service availability, because if only some pods under an instance are successfully scheduled, the DB service is still unavailable (even if all pods managed by one workload of the DB instance are successfully scheduled)

@NoicFank
Copy link
Contributor Author

Can you elaborate more? Why these deployments should be scheduled together, I can image this might be useful, just curious about the user story. Thanks.

Following is an open-source project to manage databases on K8s that you may be interested in.
https://github.com/apecloud/kubeblocks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
5 participants