[RFC]: Add control panel support for vLLM #4873

leiwen83 · 2024-05-17T02:20:50Z

Motivation.

The Fastchat-vLLM operational model offers significant advantages in deploying large language models (LLMs) for product services. 1

The controller architecture in Fastchat is particularly beneficial for LLM deployment, owing to its loosely coupled design with the vLLM backend. This allows for:

Autoscaling: The vLLM backend can join and exit the cluster freely, enabling dynamic scaling capabilities.
Rolling Updates: The introduction of new models with distinct names allows the cluster to gradually update models, a process known as rolling updates.
Centralized Access: Users are relieved from the burden of tagging different URLs or IPs for various models; they simply send their requests to the controller, which then manages the rest, including dispatching requests to the appropriate backend based on the model name and ensuring effective load balancing.

However, the challenge for Fastchat lies in managing multiple backends, including vLLM. This complexity appears to hinder its ability to keep pace with the rapid evolution of vLLM. It is disheartening to observe that Fastchat currently does not support the latest vLLM features, such as multi-LoRA, fragmented chat stream support, and guidance decoding, among others.

Refence:
[1] https://blog.vllm.ai/2023/06/20/vllm.html

Proposed Change.

So just head it up, I port the key feature of controller from fastchat, and make it at minimal shape, which for interface like /v1/../completions, it simply extract model name, and forward anything towards the backend, so that all feature of vllm could be used.

Current implement: #4861

/v1/completions: same interface of vllm's
/v1/chat/completions: same interface of vllm's
/list_models: list models' name registered into controller
/health: check controller health status
/list_workers: list worker's detailed status, models provided by each worker, and its serving status
load balance with shortest queue algo
heart beat keep alive between controller and worker

Future directions:

maybe rust could be used for reimplement the controller, if we find the performance could be improved a lot
more algo for load balance
unified metrics exposed by controller, which collected from each worker
more interface support, like embeding

Feedback Period.

No response

CC List.

@simon-mo @robertgshaw2-neuralmagic

Any Other Things.

No response

robertgshaw2-neuralmagic · 2024-05-17T15:13:03Z

@leiwen83

Can you explain why this type of control plane should be part of vLLM as opposed to something that Kubernetes should handle?

Most of the functionality you list here are core features of K8s/Knative/KServe and it seems like it will be quite difficult for us to make a production grade system

rkooo567 · 2024-05-17T16:08:58Z

+1 on the comment above. We should be very careful extending the scope of the project because many times it just becomes a suboptimal solution.

I am curious about what's the main reason of "However, the challenge for Fastchat lies in managing multiple backends, including vLLM.", and just solving this problem is sufficient to make vllm compatible to fastchat?

leiwen83 · 2024-05-18T06:56:33Z

@robertgshaw2-neuralmagic @rkooo567
I fully understand your concern for the scope which may hurt vllm's production quality.

From the past few months use experience with fastchat, the mosting extracting feature is its ”Centralized Access“ in daily management, which mean unique url and models could be managed in one big pool. autoscale/load balance are the feature naturally with this central control panel.

I am not sure k8s/knative/kserve whether could accomplish this... From my knowledge, they could only do single model serving? Certainly kserve could do autoscale/load balance well.

And the main reason why I brought up this topic, is I saw many issues and PR has been post in fastchat community regarding how to support latest vllm feature but no responsing later. So this is why I want to seek for vllm community's help for how to address ”Centralized Access“ requirement in our daily usage...

njhill · 2024-05-21T00:18:07Z

@leiwen83 I agree with other comments that this seems out of scope itself. From an autoscaling pov the important thing for vLLM itself is to expose appropriate metrics that existing cluster management systems (such as Kubernetes) can monitor and use to make and execute the scaling decisions.

The best metric probably isn't queue length, since the nature of requests in the queue(s) could be very different, and when queues are empty you would still want to balance based on the current state of requests being processed in each server.

In terms of centralized access to multiple models, this would just be an independent routing layer that can forward to vLLM deployments as appropriate. Of course the API that is exposed/forwarded should match the vLLM (OpenAI) API or forward payloads unchanged. If there are API parameters not propagated by the routing layer then it should be updated accordingly.

I do think that specialized load balancing is important though, to take into account the same queue-like metrics mentioned above, and to be session and/or prefix aware to maximize prefix cache hit-rates (so that consecutive requests which have the same chat history will hit the same server). @simon-mo mentioned that he was looking into this.

I'm not sure that python would necessarily be the best choice for such a router though.

leiwen83 added the RFC label May 17, 2024

leiwen83 mentioned this issue May 17, 2024

Add control panel allow manage multi vllm instances #4861

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Add control panel support for vLLM #4873

[RFC]: Add control panel support for vLLM #4873

leiwen83 commented May 17, 2024 •

edited

robertgshaw2-neuralmagic commented May 17, 2024 •

edited

rkooo567 commented May 17, 2024

leiwen83 commented May 18, 2024 •

edited

njhill commented May 21, 2024

[RFC]: Add control panel support for vLLM #4873

[RFC]: Add control panel support for vLLM #4873

Comments

leiwen83 commented May 17, 2024 • edited

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

robertgshaw2-neuralmagic commented May 17, 2024 • edited

rkooo567 commented May 17, 2024

leiwen83 commented May 18, 2024 • edited

njhill commented May 21, 2024

leiwen83 commented May 17, 2024 •

edited

robertgshaw2-neuralmagic commented May 17, 2024 •

edited

leiwen83 commented May 18, 2024 •

edited