Feature/raft fine tuning #874

efenocchi · 2024-04-26T01:09:56Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?
This PR introduces the ability to load a dataset from Activeloop by calling load_deep_lake_dataset instead of load_dataset (see torchtune/torchtune/datasets/_utils.py). Additionally, it enables fine-tuning of available models using the RAFT technique (see torchtune/recipes/configs/llama3/8B_lora_single_device_deep_lake_raft.yaml).

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
- include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

I want to report that there is an error in the main repository when running the integration tests in the tests/recipes/test_eleuther_eval.py file.

…uning

…eature/raft-fine-tuning # Conflicts: # pyproject.toml

…uning

pytorch-bot · 2024-04-26T01:09:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/874

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers

Thanks for the PR! Appreciate you adding the unit tests as well. My main questions are around the addition of the new dependency and its handling on imports, need for a dataloader class, and the new configs.

Re the failing test, we do not currently have lm_eval in our optional dependencies due to some other transitive dependencies, so if you follow the suggestion in the console about pip install "lm_eval=0.4.*" I suspect you will be able to get that test to pass.

ebsmothers · 2024-04-26T01:47:15Z

pyproject.toml

@@ -47,6 +47,7 @@ dev = [
    "pytest-integration",
    "tensorboard",
    "wandb",
+    "deeplake"


Does deeplake have any upstream dependencies? Wanna make sure we're aware of what we're pulling in here

if I'm not mistaken deeplake requires : aioboto3, boto3, click, humbug, libdeeplake, lz4, nest-asyncio, numpy, pathos, pillow, pydantic, pyjwt, tqdm (tested with pip show deeplake, tell me if something is wrong)

ebsmothers · 2024-04-26T01:48:13Z

recipes/configs/llama2/7B_lora_single_device_deep_lake_raft.yaml

@@ -0,0 +1,94 @@
+# Config for single device LoRA finetuning in lora_finetune_single_device.py


I don't think you need to create separate configs for a new dataset. In general we would recommend either CLI overrides or tune cp + modify the config locally rather than adding an entirely new file (we already have a bunch of configs as it is, trying to not profilerate them any more than strictly necessary)

I thought it would be helpful to have an example because it's not just another data set but it's another technique to fine tune with. I thought this example would make it easier to use this technique

ebsmothers · 2024-04-26T01:49:58Z

recipes/lora_finetune_single_device.py

@@ -525,4 +525,4 @@ def recipe_main(cfg: DictConfig) -> None:


 if __name__ == "__main__":
-    sys.exit(recipe_main())
+    sys.exit(recipe_main())


I know you checked that you ran pre-commit hooks on the summary, but please just double-check. The missing newline at the end of this file makes me a bit suspicious. You can re-run on all files by following these instructions

Checked again, added a white line at end of file, after all tests passed again

ebsmothers · 2024-04-26T01:52:40Z

torchtune/datasets/_utils.py

+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+import deeplake


I think this will break if someone is importing from datasets without having installed dev dependencies

How can we handle this part?

ebsmothers · 2024-04-26T01:53:42Z

torchtune/datasets/_utils.py

+# """
+
+
+class DeepLakeDataloader(Dataset):


Sorry noob question, but why is just generic PyTorch dataloader not sufficient here? I don't see any custom sampling logic or anything like that, seems like the rest of this should be handled on the dataset side already?

To get the dataset values we need to create a custom dataset otherwise we would not be able to access the data without throwing any error

ebsmothers · 2024-04-26T01:56:39Z

torchtune/data/_instruct_templates.py

@@ -87,6 +87,43 @@ def format(
        return prompt


+class RAFTInstructTemplate(InstructTemplate):


What's the difference between this and the AlpacaInstructTemplate with no input above? It looks pretty much the same at first glance

Yes, if we assume we have no input it's the same thing

I just refined the prompt to closely align with the structure outlined in the referenced paper

ebsmothers · 2024-04-26T02:03:20Z

torchtune/datasets/_instruct_raft.py

+from torchtune.modules.tokenizers import Tokenizer
+
+
+class InstructDatasetDeepLakeRAFT(Dataset):


High-level question on this class: is the only difference between this and InstructDataset the usage of load_deep_lake_dataset instead of load_dataset? If so, I wonder if this is something we should consider parametrizing rather than having to write an entirely new dataset class. cc @RdoubleA for any thoughts here

yes, I think it can be parameterized to use load_deep_lake_datset, can you give me some advice on how we can proceed here?
Instead in the class method _prepare_sample() we access a different input column name "cot_answer".

ebsmothers · 2024-04-26T02:05:52Z

torchtune/datasets/_raft.py

+activeloop_dataset = "hub://manufe/raft_format_dataset_biomedical"  # Replace with your ActiveLoop dataset
+
+
+def raft_dataset(


What is this dataset exactly? Seems like raft_dataset is still the generic dataset type. But for dataset builders like these we typically point to a canonical dataset on the Hub (e.g. Alpaca). Is raft_format_dataset_biomedical the canonical dataset here? If so, is this something that's widely-used, or intended more as a demo?

RAFT being a fairly new technique there is no real reference dataset, this is a dataset that we are using on a real project and can be used as a demo

here too perhaps we can replace the dataset with an input taken in the yaml file or passed as a parameter in the cli

efenocchi · 2024-05-04T20:59:43Z

Hi @ebsmothers,
thank you very much for the suggestions and for the quick response you gave me, sorry it took me so long to reply.. if there is anything I can do to get this PR accepted I will be happy to do it.
Have a good weekend.
Emanuele

…n the referenced paper.

ebsmothers

Hi @efenocchi thanks for making the updates and apologies for the delay on getting back to you. Given that

(a) we try to keep our core dependencies pretty minimal in torchtune, and
(b) integrating a dev dependency tightly into our core code (in this case torchtune/datasets) is something we try to avoid,

I think we may want to take a slightly different approach on this PR. We have another ongoing integration with ClearML Logger and I wonder if we can follow a similar path here. Namely, you can push the change to a fork, and we can highlight it as a community integration in our README. Then as we get more usage and requests to integrate we can look at merging into main. Basically the process outlined here.

Let me know if this type of integration makes sense to you. Thanks!

efenocchi added 30 commits April 16, 2024 18:29

config file to train with deep lake and raft

85b7227

introduce RAFT settings

3574e99

raft dataset setting

d256304

deeplake dataloader setting

09c54cd

temporal deeplake info

a4c85be

prompt fixing

e3f0de9

Deep Lake Dataloader

6e13a0d

Deep Lake Dataloader

c4b0b4b

Deep Lake Dataloader test

a726188

Deep Lake Dataloader test

ded8f53

config file to train with deep lake and raft

dd165d9

introduce RAFT settings

d63c1df

raft dataset setting

67061e5

deeplake dataloader setting

4140002

temporal deeplake info

69e02d9

prompt fixing

d0cff7c

Deep Lake Dataloader

a581df3

Deep Lake Dataloader

cd866fc

Deep Lake Dataloader test

9281c0f

Deep Lake Dataloader test

61fcb93

Merge remote-tracking branch 'origin/main'

00b026b

Deep Lake Dataloader test

93f35d5

Deep Lake Dataloader test

9d2dd39

setting fix

108d146

raft file configuration

6055bf2

instruction file for raft format training

ac54a33

Deep Lake Dataloader

947aefc

initial code restored

67de756

initial code restored

20c7459

instruct_dataset function renamed to instruct_dataset_raft

00cbd7b

efenocchi and others added 17 commits April 24, 2024 11:43

no load env

9034b73

no load env

140d950

Merge remote-tracking branch 'upstream/main' into feature/raft-fine-t…

ec88d9b

…uning

Update pyproject.toml

0033035

Update pyproject.toml

8f5ae10

added fixed dataset

b83bcf0

deeplake dependency

c7a635a

test deep lake dataloader

1aa8576

test raft dataset

767531e

test instruct raft

6dff9f9

default train parameter

687b98a

Merge remote-tracking branch 'origin/feature/raft-fine-tuning' into f…

c6583da

…eature/raft-fine-tuning # Conflicts: # pyproject.toml

Merge remote-tracking branch 'upstream/main' into feature/raft-fine-t…

43d74af

…uning

import Dataloader init

23b3492

Fix formatting

63d36c4

Fix formatting

54a9b14

Fix formatting

2a50585

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 26, 2024

ebsmothers reviewed Apr 26, 2024

View reviewed changes

fix end of files

2fa0b4a

efenocchi added 7 commits May 6, 2024 18:24

Refined project prompt to closely align with the structure outlined i…

197b34a

…n the referenced paper.

Merge remote-tracking branch 'origin/main' into feature/raft-fine-tuning

d98395e

Merge remote-tracking branch 'origin/main' into feature/raft-fine-tuning

f016b06

just train on input

04d9f6a

max_seq_len update

6a89fff

max_seq_len update, remove train_on_input

6b80353

repice update

50f6540

ebsmothers reviewed May 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/raft fine tuning #874

Feature/raft fine tuning #874

efenocchi commented Apr 26, 2024

pytorch-bot bot commented Apr 26, 2024

ebsmothers left a comment

ebsmothers Apr 26, 2024

efenocchi May 4, 2024

ebsmothers Apr 26, 2024

efenocchi May 4, 2024

ebsmothers Apr 26, 2024

efenocchi May 4, 2024

ebsmothers Apr 26, 2024

efenocchi May 4, 2024

ebsmothers Apr 26, 2024

efenocchi May 4, 2024

ebsmothers Apr 26, 2024

efenocchi May 4, 2024 •

edited

efenocchi May 6, 2024

ebsmothers Apr 26, 2024

efenocchi May 4, 2024 •

edited

ebsmothers Apr 26, 2024

efenocchi May 4, 2024

efenocchi May 6, 2024

efenocchi commented May 4, 2024

ebsmothers left a comment

		@@ -0,0 +1,94 @@
		# Config for single device LoRA finetuning in lora_finetune_single_device.py

		@@ -87,6 +87,43 @@ def format(
		return prompt


		class RAFTInstructTemplate(InstructTemplate):

		from torchtune.modules.tokenizers import Tokenizer


		class InstructDatasetDeepLakeRAFT(Dataset):

		activeloop_dataset = "hub://manufe/raft_format_dataset_biomedical" # Replace with your ActiveLoop dataset


		def raft_dataset(

Feature/raft fine tuning #874

Are you sure you want to change the base?

Feature/raft fine tuning #874

Conversation

efenocchi commented Apr 26, 2024

Context

Changelog

Test plan

pytorch-bot bot commented Apr 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/874

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efenocchi May 4, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efenocchi May 4, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efenocchi commented May 4, 2024

ebsmothers left a comment

Choose a reason for hiding this comment

efenocchi May 4, 2024 •

edited

efenocchi May 4, 2024 •

edited