-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Add control panel support for vLLM #4873
Comments
Can you explain why this type of control plane should be part of vLLM as opposed to something that Kubernetes should handle? Most of the functionality you list here are core features of K8s/Knative/KServe and it seems like it will be quite difficult for us to make a production grade system |
+1 on the comment above. We should be very careful extending the scope of the project because many times it just becomes a suboptimal solution. I am curious about what's the main reason of "However, the challenge for Fastchat lies in managing multiple backends, including vLLM.", and just solving this problem is sufficient to make vllm compatible to fastchat? |
@robertgshaw2-neuralmagic @rkooo567 From the past few months use experience with fastchat, the mosting extracting feature is its ”Centralized Access“ in daily management, which mean unique url and models could be managed in one big pool. autoscale/load balance are the feature naturally with this central control panel. I am not sure k8s/knative/kserve whether could accomplish this... From my knowledge, they could only do single model serving? Certainly kserve could do autoscale/load balance well. And the main reason why I brought up this topic, is I saw many issues and PR has been post in fastchat community regarding how to support latest vllm feature but no responsing later. So this is why I want to seek for vllm community's help for how to address ”Centralized Access“ requirement in our daily usage... |
@leiwen83 I agree with other comments that this seems out of scope itself. From an autoscaling pov the important thing for vLLM itself is to expose appropriate metrics that existing cluster management systems (such as Kubernetes) can monitor and use to make and execute the scaling decisions. The best metric probably isn't queue length, since the nature of requests in the queue(s) could be very different, and when queues are empty you would still want to balance based on the current state of requests being processed in each server. In terms of centralized access to multiple models, this would just be an independent routing layer that can forward to vLLM deployments as appropriate. Of course the API that is exposed/forwarded should match the vLLM (OpenAI) API or forward payloads unchanged. If there are API parameters not propagated by the routing layer then it should be updated accordingly. I do think that specialized load balancing is important though, to take into account the same queue-like metrics mentioned above, and to be session and/or prefix aware to maximize prefix cache hit-rates (so that consecutive requests which have the same chat history will hit the same server). @simon-mo mentioned that he was looking into this. I'm not sure that python would necessarily be the best choice for such a router though. |
Motivation.
The Fastchat-vLLM operational model offers significant advantages in deploying large language models (LLMs) for product services. 1
The controller architecture in Fastchat is particularly beneficial for LLM deployment, owing to its loosely coupled design with the vLLM backend. This allows for:
Autoscaling: The vLLM backend can join and exit the cluster freely, enabling dynamic scaling capabilities.
Rolling Updates: The introduction of new models with distinct names allows the cluster to gradually update models, a process known as rolling updates.
Centralized Access: Users are relieved from the burden of tagging different URLs or IPs for various models; they simply send their requests to the controller, which then manages the rest, including dispatching requests to the appropriate backend based on the model name and ensuring effective load balancing.
However, the challenge for Fastchat lies in managing multiple backends, including vLLM. This complexity appears to hinder its ability to keep pace with the rapid evolution of vLLM. It is disheartening to observe that Fastchat currently does not support the latest vLLM features, such as multi-LoRA, fragmented chat stream support, and guidance decoding, among others.
Refence:
[1] https://blog.vllm.ai/2023/06/20/vllm.html
Proposed Change.
So just head it up, I port the key feature of controller from fastchat, and make it at minimal shape, which for interface like /v1/../completions, it simply extract model name, and forward anything towards the backend, so that all feature of vllm could be used.
Current implement: #4861
Future directions:
Feedback Period.
No response
CC List.
@simon-mo @robertgshaw2-neuralmagic
Any Other Things.
No response
The text was updated successfully, but these errors were encountered: