Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Add control panel support for vLLM #4873

Open
7 of 11 tasks
leiwen83 opened this issue May 17, 2024 · 4 comments
Open
7 of 11 tasks

[RFC]: Add control panel support for vLLM #4873

leiwen83 opened this issue May 17, 2024 · 4 comments
Labels

Comments

@leiwen83
Copy link
Contributor

leiwen83 commented May 17, 2024

Motivation.

The Fastchat-vLLM operational model offers significant advantages in deploying large language models (LLMs) for product services. 1

The controller architecture in Fastchat is particularly beneficial for LLM deployment, owing to its loosely coupled design with the vLLM backend. This allows for:

  • Autoscaling: The vLLM backend can join and exit the cluster freely, enabling dynamic scaling capabilities.

  • Rolling Updates: The introduction of new models with distinct names allows the cluster to gradually update models, a process known as rolling updates.

  • Centralized Access: Users are relieved from the burden of tagging different URLs or IPs for various models; they simply send their requests to the controller, which then manages the rest, including dispatching requests to the appropriate backend based on the model name and ensuring effective load balancing.

However, the challenge for Fastchat lies in managing multiple backends, including vLLM. This complexity appears to hinder its ability to keep pace with the rapid evolution of vLLM. It is disheartening to observe that Fastchat currently does not support the latest vLLM features, such as multi-LoRA, fragmented chat stream support, and guidance decoding, among others.

Refence:
[1] https://blog.vllm.ai/2023/06/20/vllm.html

Proposed Change.

So just head it up, I port the key feature of controller from fastchat, and make it at minimal shape, which for interface like /v1/../completions, it simply extract model name, and forward anything towards the backend, so that all feature of vllm could be used.

Current implement: #4861

  • /v1/completions: same interface of vllm's
  • /v1/chat/completions: same interface of vllm's
  • /list_models: list models' name registered into controller
  • /health: check controller health status
  • /list_workers: list worker's detailed status, models provided by each worker, and its serving status
  • load balance with shortest queue algo
  • heart beat keep alive between controller and worker

Future directions:

  • maybe rust could be used for reimplement the controller, if we find the performance could be improved a lot
  • more algo for load balance
  • unified metrics exposed by controller, which collected from each worker
  • more interface support, like embeding

Feedback Period.

No response

CC List.

@simon-mo @robertgshaw2-neuralmagic

Any Other Things.

No response

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented May 17, 2024

@leiwen83

Can you explain why this type of control plane should be part of vLLM as opposed to something that Kubernetes should handle?

Most of the functionality you list here are core features of K8s/Knative/KServe and it seems like it will be quite difficult for us to make a production grade system

@rkooo567
Copy link
Collaborator

+1 on the comment above. We should be very careful extending the scope of the project because many times it just becomes a suboptimal solution.

I am curious about what's the main reason of "However, the challenge for Fastchat lies in managing multiple backends, including vLLM.", and just solving this problem is sufficient to make vllm compatible to fastchat?

@leiwen83
Copy link
Contributor Author

leiwen83 commented May 18, 2024

@robertgshaw2-neuralmagic @rkooo567
I fully understand your concern for the scope which may hurt vllm's production quality.

From the past few months use experience with fastchat, the mosting extracting feature is its ”Centralized Access“ in daily management, which mean unique url and models could be managed in one big pool. autoscale/load balance are the feature naturally with this central control panel.

I am not sure k8s/knative/kserve whether could accomplish this... From my knowledge, they could only do single model serving? Certainly kserve could do autoscale/load balance well.

And the main reason why I brought up this topic, is I saw many issues and PR has been post in fastchat community regarding how to support latest vllm feature but no responsing later. So this is why I want to seek for vllm community's help for how to address ”Centralized Access“ requirement in our daily usage...

@njhill
Copy link
Collaborator

njhill commented May 21, 2024

@leiwen83 I agree with other comments that this seems out of scope itself. From an autoscaling pov the important thing for vLLM itself is to expose appropriate metrics that existing cluster management systems (such as Kubernetes) can monitor and use to make and execute the scaling decisions.

The best metric probably isn't queue length, since the nature of requests in the queue(s) could be very different, and when queues are empty you would still want to balance based on the current state of requests being processed in each server.

In terms of centralized access to multiple models, this would just be an independent routing layer that can forward to vLLM deployments as appropriate. Of course the API that is exposed/forwarded should match the vLLM (OpenAI) API or forward payloads unchanged. If there are API parameters not propagated by the routing layer then it should be updated accordingly.

I do think that specialized load balancing is important though, to take into account the same queue-like metrics mentioned above, and to be session and/or prefix aware to maximize prefix cache hit-rates (so that consecutive requests which have the same chat history will hit the same server). @simon-mo mentioned that he was looking into this.

I'm not sure that python would necessarily be the best choice for such a router though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants