You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So I just noticed this very strange behaviour that has perhaps no severe implications but nevertheless is interesting to explore:
I am hosting a Mixtral model on a server with 2 A100s. My product comprises 3 LLM calls, 2 of them using 2 adapters (1 for each) and the last one using the base model. After a new LoRAX release, I download the latest image and run it.
After the server is all warmed up, I usually run a small validation script to ensure that all requests are successfully served.
I noticed that the server would initially be very slow in swapping the adapters, regardless of the arguments passed to the docker run command. Eventually though, we would reach an almost instantaneous swapping, so I didn't pay too much attention, until today, where I noticed the following:
If only sequential calls are sent to the server (regardless of using the adapters or not), the whole pipeline would take ~15sec.
If I used a script to parallelize them (say via concurrent.futures) and run it, they would naturally finish faster.
But this parallel execution would result in a speed improvement that would somehow persist even if I revisited my sequential script, noticing that the same sequential operation that took ~15sec, would now finish in less than 3sec!
It appears that this parallel execution of different adapters would lead to them being loaded to cache, which was not the case despite the relevant parameters being set in the docker config.
Curious if anyone else has encountered this. In any case, I am still at awe by the speed of integrating new features and improvements and unaware if this is expected, so I thought I'd flag this. 💪
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Expected behavior
The text was updated successfully, but these errors were encountered:
lighteternal
changed the title
LoRAX server with 2 GPUs and multiple adapters becomes faster in swapping ONLY after parallel execution of requests.
LoRAX server with 2 GPUs and multiple adapters becomes permanently faster in swapping ONLY after parallel execution of requests.
Apr 8, 2024
I face the same issue, while I'm running 1 model with 2 different adapters. As I send sequential requests, the 1st adapter loads and is kept attached to the model, and if in the next request I send in a different adapter_id, the 1st adapter is offloaded, and loads in the 2nd.. and this keeps on happening.
However, when i use concurrent.futures to send parallel req, the 1st request fails with the error
max_rank = max(adapter_weights[idx].lora_a_r for idx in segment_indices if idx in adapter_weights)
tgi-service-1 | ValueError: max() arg is an empty sequence
and the later requests are successful. Here in the parallel req, the adapters are kept in the cache and are not loading and offloading again and again. I'm not sure if 2 adapters merging is being used or they are being cached as there is an option to merge 2 adapters in Lorax.
System Info
So I just noticed this very strange behaviour that has perhaps no severe implications but nevertheless is interesting to explore:
I am hosting a Mixtral model on a server with 2 A100s. My product comprises 3 LLM calls, 2 of them using 2 adapters (1 for each) and the last one using the base model. After a new LoRAX release, I download the latest image and run it.
After the server is all warmed up, I usually run a small validation script to ensure that all requests are successfully served.
I noticed that the server would initially be very slow in swapping the adapters, regardless of the arguments passed to the
docker run
command. Eventually though, we would reach an almost instantaneous swapping, so I didn't pay too much attention, until today, where I noticed the following:concurrent.futures
) and run it, they would naturally finish faster.Curious if anyone else has encountered this. In any case, I am still at awe by the speed of integrating new features and improvements and unaware if this is expected, so I thought I'd flag this. 💪
Information
Tasks
Reproduction
Expected behavior
The text was updated successfully, but these errors were encountered: