Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating ReFT #1657

Open
wants to merge 96 commits into
base: main
Choose a base branch
from
Open

Integrating ReFT #1657

wants to merge 96 commits into from

Conversation

raven38
Copy link

@raven38 raven38 commented Apr 16, 2024

Paper link: https://arxiv.org/abs/2404.03592

This PR integrates ReFT by creating a new tuner model type. Please see #1654.

Changes

I defined a new class ReFTModel as a subclass of LycorisTuner.
src/peft/tuners/reft/layer.py has the actual operations for ReFT methods.

Status

  • Discuss specification
  • Refactor implementation
  • Test implementation
  • Add documentation for methods.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for providing this PR to add REFT to PEFT. This is already looking really good, well done!

I added a few comments, please take a look.

General comments

  • Please run make style on your code.
  • We'll have to add some tests. Would you mind looking at the tests that other methods have and add the REFT tests in a similar manner? E.g., check out the tests added by Add LayerNorm tuning model #1301.
  • Ideally, we have some docs and examples before merging this.
  • In case that you have copied code from pyreft/pyvene, please add comments with references.
  • IIUC, the implemented method is LoReFT specifically. This should be mentioned. We may also want to rename the names to LoReFT, except if you think this could be extended in the future to include more ReFT methods.
  • Speaking of names, in the descriptions, you write "REFT" but the authors spell it as "ReFT", let's adjust this. When it comes to variable names, there is no need to change those (so ReftLayer etc. can stay as is).

Hyper params

The authors describe 4 important hyper parameters in 4.1:

  1. number of prefix positions
  2. number of suffix positions
  3. set of layers to target
  4. whether to tie intervention parameters

IIUC, 3 exists via target_modules. What about 1, 2, and 4?

src/peft/peft_model.py Outdated Show resolved Hide resolved
src/peft/tuners/reft/config.py Outdated Show resolved Hide resolved
Comment on lines 54 to 55
alpha: int = field(default=8, metadata={"help": "Reft alpha"})
loc: Optional[Union[List[int], int]] = field(default=None, metadata={"help": "token locations are applied ReFT"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two don't have a docstring entry. Also, the description "token locations are applied ReFT" needs to be explained better.


def create_adapter_parameters(self, adapter_name: str, r: int, shape: Tuple[int, ...]):
rotate_layer = LowRankRotateLayer(self.out_features, r)
self.reft_R[adapter_name] = torch.nn.utils.parametrizations.orthogonal(rotate_layer, orthogonal_map='cayley')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this redundant with torch.nn.init.orthogonal_ above?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.nn.init.orthogonal_ ensures that the tensor is orthogonal at once, and torch.nn.utils.parametrizations.orthogonal ensures that the tensor is orthogonal during the training.
I checked the impelementation of torch.nn.utils.register_parametrization. It calls the a parameterization function (orthogonal_) in the initilization process. Thus, torch.nn.init.orthogonal_ is redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC, we can remove one of the invocations?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can remove one of them.

src/peft/tuners/reft/layer.py Outdated Show resolved Hide resolved
src/peft/tuners/reft/layer.py Outdated Show resolved Hide resolved
src/peft/tuners/reft/layer.py Outdated Show resolved Hide resolved
self._active_adapter = adapter_name
self.update_layer(adapter_name, r, alpha, rank_dropout, module_dropout, init_weights, **kwargs)

def _get_delta_activations(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used. Since LycorisTuner defines _get_delta_activations as abstractmethod, I define _get_delta_activations to avoid raising errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case, it should probably not be an abstract method on LycorsLayer to begin with, but let's not bother with this here. I'd just raise a NotImplementedError then, instead of having dead code.

adapter_name, r, alpha, rank_dropout, module_dropout, init_weights, use_effective_conv2d, **kwargs
)

def _get_delta_activations(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used. Since LycorisTuner defines _get_delta_activations as abstractmethod, I define _get_delta_activations to avoid raising errors.

src/peft/tuners/reft/config.py Outdated Show resolved Hide resolved
raven38 and others added 5 commits April 18, 2024 15:16
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@BenjaminBossan
Copy link
Member

@raven38 We just merged VeRA, which resulted in a bunch of merge conflicts. They should be very straightforward to resolve, let me know if you have any questions. Also ping me once this is ready for the next review.

@raven38
Copy link
Author

raven38 commented May 21, 2024

I think the code is ready for the review.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan
Copy link
Member

@raven38 Thanks for the updates, I'm currently reviewing the PR, but it'll take some time. Meanwhile, could you please run make style on your code so that the CI passes?

@raven38
Copy link
Author

raven38 commented May 27, 2024

I modified codes so that make style passes, but make style still raises the following error. Should it be fixed by this PR?

examples/boft_controlnet/test_controlnet.py:40:1: E402 Module level import not at top of file
Found 1 error.
make: *** [style] Error 1

@BenjaminBossan
Copy link
Member

Hmm, code quality checks still fail:

ruff src tests examples docs scripts docker
ruff format --check src tests examples docs scripts docker
Would reformat: src/peft/tuners/reft/config.py
Would reformat: src/peft/tuners/reft/layer.py
Would reformat: src/peft/tuners/reft/model.py
Would reformat: tests/test_custom_models.py
Would reformat: tests/test_stablediffusion.py
Would reformat: tests/testing_common.py
6 files would be reformatted, 159 files already formatted

This does not appear to be related to anything in test_controlnet, which is independent of your PR. Is it possible that you use a different ruff version? The CI uses v0.2.2.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates. This looks very promising, and I don't think we need any major changes.

Before merging, we should add examples and docs, but we can tackle this after we have streamlined the implementation.

Regarding the intervention locations, your implementation departs from the pyreft implementation. Here, it is basically a hyper-parameter that is passed to the config. That means that for a given model, these values are fixed, right? In pyreft, it is part of the dataset, so it can vary from sample to sample.

I wonder if we should not stick with the pyreft implementation. This means that we can remove this parameter completely from PEFT. Instead, we can check the arguments passed to forward. IIUC, a pyreft dataset would contain an additional key, intervention_locations. If this is passed, we could use it, otherwise intervene on all positions. This would also mean that the PEFT implementation could use the datasets from pyreft, right? Maybe I'm missing something and there is a disadvantage in my suggestion, please LMK.

Moreover, in the paper, the authors discuss a variant with tied weights. AFAICT, there is currently no option to configure tied vs untied. Would it be possible to implement this?

@@ -0,0 +1,20 @@
# Copyright 2023-present the HuggingFace Inc. team.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2023-present the HuggingFace Inc. team.
# Copyright 2024-present the HuggingFace Inc. team.

Same for the other newly created modules.



def parse_positions(positions: str):
# Code borrow from https://github.com/stanfordnlp/pyreft/pyreft/dataset.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No shame in saying this is "copied", unless you plan to give the code back ;-)

Also, let's add a perma-link: https://github.com/stanfordnlp/pyreft/blob/432dc64d01ff80002c6e83bfdaeb55710bf6a80a/pyreft/dataset.py#L53

dropout: float,
**kwargs,
) -> None:
"""Internal function to create loha adapter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Internal function to create loha adapter
"""Internal function to create loreft adapter

Some arguments are not described below, most notably loc. Could you please update the docstring?

self.reft_A[adapter_name] = torch.nn.Linear(self.out_features, r)

def merge(self, safe_merge: bool = False, adapter_names: Optional[list[str]] = None) -> None:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So merging is not supported, right? If we just pass when the user calls model.merge(), they might get the impression that it worked. Instead, we should raise a ValueError that merging is not possible with LoReFT. Same for unmerge.

self.last_n[adapter_name] = last_n
self.alpha[adapter_name] = alpha
self.scaling[adapter_name] = alpha / r
self.dropout[adapter_name] = torch.nn.Dropout(dropout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use nn.Identity if dropout is 0.

loc = torch.cat([torch.arange(first_n), torch.arange(result.shape[1]-last_n, result.shape[1])])
selected_results = torch.gather(result, 1, loc)
rotated_base = rotate_layer(selected_results)
offset = torch.matmul((learned_source(selected_results) - rotated_base), rotate_layer.weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same argument as above regarding forward.

r (`int`): LoReFT rank.
alpha (`int`): LoReFT alpha.
loc (`Optional[str]`):
Token locations are applied LoReFT.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs much better explanation, ideally with a few examples. Also, it should be mentioned that this only really makes sense for language models, right?


def __repr__(self) -> str:
rep = super().__repr__()
return "reft." + rep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "reft." + rep
return "loreft." + rep

In case we add more ReFT layers in the future. This is not consistent with the module name, but I think that's not crucial.


class LoReftModel(LycorisTuner):
"""
LoReFT model from a pretrained model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add links to the paper and the reference repo.

default=False,
metadata={"help": "Whether to share the ReFT parameters between blocks or not."},
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So right now, LoReFT has no init_weights argument, unlike almost all other adapters. I think it would be useful to have this. I think you mentioned somewhere that by default, LoReFT is not initialized to be the identity transform. In that case, we could set the default value to False (which in PEFT means not identity), but it would still be nice for users to have the option to initialize as identity by passing init_weights=True.

@frankaging
Copy link

frankaging commented May 30, 2024

hey! thanks again for the PR (i am one of the authors of the ReFT paper). i skim through the code change, it looks very promising.

two quick questions:

  1. with the current peft config, could we target any residual stream with loreft? if so, what will the config look like?
  2. if i add a loreft adaptor in and given we only intervene on the prompt tokens, during the decoding process, the adaptor will not have an effect right?

thanks!

@BenjaminBossan
Copy link
Member

@frankaging Thanks a lot for taking a look. Regarding your questions, I'll let @raven38 provide the final answer. Personally, I think 2 is right and I don't quite understand 1, maybe you can give an example?

@frankaging
Copy link

@frankaging Thanks a lot for taking a look. Regarding your questions, I'll let @raven38 provide the final answer. Personally, I think 2 is right and I don't quite understand 1, maybe you can give an example?

Thanks for the update! For (1), i was wondering: for example, when targeting the attention output matrix for LoRA of a Llama model, we might have a config like this:

from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    r=4, lora_alpha=32, target_modules=["o_proj"], layers_to_transform=[15],
    use_rslora=True, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config)

where target_modules=["o_proj"]. Since for LoReFT, we want to target the residual stream (transformer block/layer output), what should we put for the target_modules in that case? Thanks!

@BenjaminBossan
Copy link
Member

Since for LoReFT, we want to target the residual stream (transformer block/layer output), what should we put for the target_modules in that case? Thanks!

I see, I think I get what you mean. Good question. I think as is, it would not work. We would have to target the attention module itself (the type depending on what kind of attention is being used), targeting a Linear layer would not work. I think this is what you're getting at, right?

If you have some pointers on how this is implemented in pyreft/pyvene, please share.

@frankaging
Copy link

Since for LoReFT, we want to target the residual stream (transformer block/layer output), what should we put for the target_modules in that case? Thanks!

I see, I think I get what you mean. Good question. I think as is, it would not work. We would have to target the attention module itself (the type depending on what kind of attention is being used), targeting a Linear layer would not work. I think this is what you're getting at, right?

If you have some pointers on how this is implemented in pyreft/pyvene, please share.

Thanks for your reply. Yes, we need to target the whole block module itself (e.g., model.layers[15]). Will this be doable? If not, I think it is still valuable to have the current form where we target submodules (e.g., attention output or MLP output).

With pyreft or pyvene, we use register_module_forward_hook to place a callback to the existing computation graph, instead of overwriting current modules with the additional adapter modules. This is quite different, so it might not be that helpful here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants