Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning the combination weights of pre-trained LoRA Modules #1655

Closed
mahdibeit opened this issue Apr 15, 2024 · 5 comments
Closed

Learning the combination weights of pre-trained LoRA Modules #1655

mahdibeit opened this issue Apr 15, 2024 · 5 comments

Comments

@mahdibeit
Copy link

mahdibeit commented Apr 15, 2024

Feature request

PEFT can combine pre-trained LoRA modules by averaging them or providing custom weights for weighted averaging. This paper showed that learning these weights is better than naive averaging in few-shot adaption settings.
WLoRA

Motivation

Learning the combination weights allows users to utilize already available pre-trained LoRA modules in the Hugging Face Models. Also, it is very parameter efficient since we are only learning the combination weights. More importantly, it can surpass learning a LoRA from scratch in settings where the number of training samples is limited.

Your contribution

I can submit a PR. Then, PEFT can combine any pre-trained LoRA using the following format:

wlora_config = WLoraConfig(skilled_loras = [PATH_TO_UPSTREAM_1, PATH_TO_UPSTREAM_2, ])

model = get_peft_model(llama2, wlora_config )

@BenjaminBossan
Copy link
Member

Hi, thanks you for proposing to add this method.

I only skimmed the paper but IIUC, we assume that the user has a couple of already trained LoRA adapters and now wants to combine them for a new task. The idea is that by learning the weights used for the weighted average (the weights argument for add_weighted_adapter) can lead to better results than naive uniform weights. (Note that we offer many combination types, not just averaging, maybe that's worth looking into for the paper.)

To learn these weights, I assume we have to load all the LoRA adapters at training time, freeze their weights, then add an extra scaling factor to this line, is that right?

I haven't thought through the overall design of this, but I think it should be possible to add this to the existing LoRA code without too many additions. Feel free to open a draft PR where we can discuss the design.

@mahdibeit
Copy link
Author

Hi @BenjaminBossan , thanks for taking the time to read the paper.

Thank you for your great suggestion. We will evaluate other combination methods in the paper.

To answer your question, yes, you are absolutely right. We can just use the existing LoRA code with a Boolean like learn_combination_weights to configure the training process. Overall, I need to do the following:

  1. Instantiate a trainable tensor named combination_weights that learns the scaling factor for each pre-trained, frozen LoRA.
  2. During the forward pass, call softmax over combination_weights and then multiply each LoRA by the corresponding index of the combination_weights in this line .

During these steps, I have to make sure that LoRA weights are frozen and merge method works as intended.


Also, it is possible to create a new peft/tuner module named something like WLoRA and write the appropriate class in there. This allows users to just run the following

wlora_config = WLoraConfig(upstream_loras = [PATH_TO_UPSTREAM_1, PATH_TO_UPSTREAM_2, ])

model = get_peft_model(llama2, wlora_config )

I personally prefer the second option as it does not complicate the main LoRA module and allows easier use. However, I trust your judgment. Let me know which direction you prefer and I can start implementing and opening a draft PR.

@BenjaminBossan
Copy link
Member

I think the 2nd suggestion with a dedicated class is good, it can still re-use much of the existing code, though. Regardless, if you have some code to share, feel free to do so, as that makes the discussion much easier.

@mahdibeit
Copy link
Author

Hi @BenjaminBossan, I just opened a draft PR using the first method. Let me know what you think. The main concern that I have is this line to freeze pre-trained lora_A and lora_B.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants