-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating ReFT #1657
base: main
Are you sure you want to change the base?
Integrating ReFT #1657
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for providing this PR to add REFT to PEFT. This is already looking really good, well done!
I added a few comments, please take a look.
General comments
- Please run
make style
on your code. - We'll have to add some tests. Would you mind looking at the tests that other methods have and add the REFT tests in a similar manner? E.g., check out the tests added by Add LayerNorm tuning model #1301.
- Ideally, we have some docs and examples before merging this.
- In case that you have copied code from pyreft/pyvene, please add comments with references.
- IIUC, the implemented method is
LoReFT
specifically. This should be mentioned. We may also want to rename the names toLoReFT
, except if you think this could be extended in the future to include more ReFT methods. - Speaking of names, in the descriptions, you write "REFT" but the authors spell it as "ReFT", let's adjust this. When it comes to variable names, there is no need to change those (so
ReftLayer
etc. can stay as is).
Hyper params
The authors describe 4 important hyper parameters in 4.1:
- number of prefix positions
- number of suffix positions
- set of layers to target
- whether to tie intervention parameters
IIUC, 3 exists via target_modules
. What about 1, 2, and 4?
src/peft/tuners/reft/config.py
Outdated
alpha: int = field(default=8, metadata={"help": "Reft alpha"}) | ||
loc: Optional[Union[List[int], int]] = field(default=None, metadata={"help": "token locations are applied ReFT"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two don't have a docstring entry. Also, the description "token locations are applied ReFT" needs to be explained better.
|
||
def create_adapter_parameters(self, adapter_name: str, r: int, shape: Tuple[int, ...]): | ||
rotate_layer = LowRankRotateLayer(self.out_features, r) | ||
self.reft_R[adapter_name] = torch.nn.utils.parametrizations.orthogonal(rotate_layer, orthogonal_map='cayley') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this redundant with torch.nn.init.orthogonal_
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.nn.init.orthogonal_
ensures that the tensor is orthogonal at once, and torch.nn.utils.parametrizations.orthogonal
ensures that the tensor is orthogonal during the training.
I checked the impelementation of torch.nn.utils.register_parametrization
. It calls the a parameterization function (orthogonal_
) in the initilization process. Thus, torch.nn.init.orthogonal_
is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC, we can remove one of the invocations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can remove one of them.
self._active_adapter = adapter_name | ||
self.update_layer(adapter_name, r, alpha, rank_dropout, module_dropout, init_weights, **kwargs) | ||
|
||
def _get_delta_activations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used. Since LycorisTuner defines _get_delta_activations
as abstractmethod
, I define _get_delta_activations
to avoid raising errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In that case, it should probably not be an abstract method on LycorsLayer
to begin with, but let's not bother with this here. I'd just raise a NotImplementedError
then, instead of having dead code.
src/peft/tuners/reft/layer.py
Outdated
adapter_name, r, alpha, rank_dropout, module_dropout, init_weights, use_effective_conv2d, **kwargs | ||
) | ||
|
||
def _get_delta_activations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used. Since LycorisTuner defines _get_delta_activations
as abstractmethod
, I define _get_delta_activations
to avoid raising errors.
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@raven38 We just merged VeRA, which resulted in a bunch of merge conflicts. They should be very straightforward to resolve, let me know if you have any questions. Also ping me once this is ready for the next review. |
I think the code is ready for the review. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@raven38 Thanks for the updates, I'm currently reviewing the PR, but it'll take some time. Meanwhile, could you please run |
I modified codes so that make style passes, but make style still raises the following error. Should it be fixed by this PR?
|
Hmm, code quality checks still fail:
This does not appear to be related to anything in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the updates. This looks very promising, and I don't think we need any major changes.
Before merging, we should add examples and docs, but we can tackle this after we have streamlined the implementation.
Regarding the intervention locations, your implementation departs from the pyreft implementation. Here, it is basically a hyper-parameter that is passed to the config. That means that for a given model, these values are fixed, right? In pyreft, it is part of the dataset, so it can vary from sample to sample.
I wonder if we should not stick with the pyreft implementation. This means that we can remove this parameter completely from PEFT. Instead, we can check the arguments passed to forward
. IIUC, a pyreft dataset would contain an additional key, intervention_locations
. If this is passed, we could use it, otherwise intervene on all positions. This would also mean that the PEFT implementation could use the datasets from pyreft, right? Maybe I'm missing something and there is a disadvantage in my suggestion, please LMK.
Moreover, in the paper, the authors discuss a variant with tied weights. AFAICT, there is currently no option to configure tied vs untied. Would it be possible to implement this?
@@ -0,0 +1,20 @@ | |||
# Copyright 2023-present the HuggingFace Inc. team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2023-present the HuggingFace Inc. team. | |
# Copyright 2024-present the HuggingFace Inc. team. |
Same for the other newly created modules.
|
||
|
||
def parse_positions(positions: str): | ||
# Code borrow from https://github.com/stanfordnlp/pyreft/pyreft/dataset.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No shame in saying this is "copied", unless you plan to give the code back ;-)
Also, let's add a perma-link: https://github.com/stanfordnlp/pyreft/blob/432dc64d01ff80002c6e83bfdaeb55710bf6a80a/pyreft/dataset.py#L53
dropout: float, | ||
**kwargs, | ||
) -> None: | ||
"""Internal function to create loha adapter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Internal function to create loha adapter | |
"""Internal function to create loreft adapter |
Some arguments are not described below, most notably loc
. Could you please update the docstring?
self.reft_A[adapter_name] = torch.nn.Linear(self.out_features, r) | ||
|
||
def merge(self, safe_merge: bool = False, adapter_names: Optional[list[str]] = None) -> None: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So merging is not supported, right? If we just pass when the user calls model.merge()
, they might get the impression that it worked. Instead, we should raise a ValueError
that merging is not possible with LoReFT. Same for unmerge
.
self.last_n[adapter_name] = last_n | ||
self.alpha[adapter_name] = alpha | ||
self.scaling[adapter_name] = alpha / r | ||
self.dropout[adapter_name] = torch.nn.Dropout(dropout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use nn.Identity
if dropout is 0.
loc = torch.cat([torch.arange(first_n), torch.arange(result.shape[1]-last_n, result.shape[1])]) | ||
selected_results = torch.gather(result, 1, loc) | ||
rotated_base = rotate_layer(selected_results) | ||
offset = torch.matmul((learned_source(selected_results) - rotated_base), rotate_layer.weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same argument as above regarding forward
.
r (`int`): LoReFT rank. | ||
alpha (`int`): LoReFT alpha. | ||
loc (`Optional[str]`): | ||
Token locations are applied LoReFT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs much better explanation, ideally with a few examples. Also, it should be mentioned that this only really makes sense for language models, right?
|
||
def __repr__(self) -> str: | ||
rep = super().__repr__() | ||
return "reft." + rep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return "reft." + rep | |
return "loreft." + rep |
In case we add more ReFT layers in the future. This is not consistent with the module name, but I think that's not crucial.
|
||
class LoReftModel(LycorisTuner): | ||
""" | ||
LoReFT model from a pretrained model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add links to the paper and the reference repo.
default=False, | ||
metadata={"help": "Whether to share the ReFT parameters between blocks or not."}, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So right now, LoReFT has no init_weights
argument, unlike almost all other adapters. I think it would be useful to have this. I think you mentioned somewhere that by default, LoReFT is not initialized to be the identity transform. In that case, we could set the default value to False
(which in PEFT means not identity), but it would still be nice for users to have the option to initialize as identity by passing init_weights=True
.
hey! thanks again for the PR (i am one of the authors of the ReFT paper). i skim through the code change, it looks very promising. two quick questions:
thanks! |
@frankaging Thanks a lot for taking a look. Regarding your questions, I'll let @raven38 provide the final answer. Personally, I think 2 is right and I don't quite understand 1, maybe you can give an example? |
Thanks for the update! For (1), i was wondering: for example, when targeting the attention output matrix for LoRA of a Llama model, we might have a config like this: from peft import LoraConfig, get_peft_model
peft_config = LoraConfig(
r=4, lora_alpha=32, target_modules=["o_proj"], layers_to_transform=[15],
use_rslora=True, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config) where |
I see, I think I get what you mean. Good question. I think as is, it would not work. We would have to target the attention module itself (the type depending on what kind of attention is being used), targeting a If you have some pointers on how this is implemented in pyreft/pyvene, please share. |
Thanks for your reply. Yes, we need to target the whole block module itself (e.g., With |
Paper link: https://arxiv.org/abs/2404.03592
This PR integrates ReFT by creating a new tuner model type. Please see #1654.
Changes
I defined a new class
ReFTModel
as a subclass ofLycorisTuner
.src/peft/tuners/reft/layer.py
has the actual operations for ReFT methods.Status