Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning on a custom dataset #366

Open
1 of 2 tasks
pankajtalk opened this issue Feb 5, 2024 · 6 comments
Open
1 of 2 tasks

Finetuning on a custom dataset #366

pankajtalk opened this issue Feb 5, 2024 · 6 comments
Assignees
Labels

Comments

@pankajtalk
Copy link

System Info

Various versions
2024-01-10 08:35:17 - Successfully installed bitsandbytes-0.39.1 black-23.12.1 brotli-1.1.0 inflate64-1.0.0 llama-recipes-0.0.1 multivolumefile-0.2.3 pathspec-0.12.1 peft-0.6.0.dev0 py7zr-0.20.6 pybcj-1.0.2 pycryptodomex-3.19.1 pyppmd-1.0.0 pyzstd-0.15.9 texttable-1.7.0 tokenize-rt-5.2.0 tomli-2.0.1 torch-2.1.0+cu118 triton-2.1.0

Finetuning command being executed.
torchrun --nnode=4 --nproc_per_node=1 --rdzv_backend=c10d --rdzv_endpoint=10.0.1.14:29400 --rdzv_conf=read_timeout=600 examples/finetuning.py --dataset "custom_dataset" --custom_dataset.file "/mnt/scripts/custom_dataset.py" --enable_fsdp --use_peft --peft_method lora --pure_bf16 --mixed_precision --batch_size_training 1 --model_name $MODEL_NAME --output_dir /home/datascience/outputs --num_epochs 1 --save_model

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

I am using below command to finetune "Llama-2-7b-hf" model on a custom dataset. I have specified the --dataset and --custom_dataset.file params to the finetuning.py file.

torchrun examples/finetuning.py
--enable_fsdp
--dataset custom_dataset
--custom_dataset.file /mnt/scripts/custom_dataset.py
--use_peft
--peft_method lora
--pure_bf16
--mixed_precision
--batch_size_training 1
--model_name $MODEL_NAME
--output_dir /home/datascience/outputs
--num_epochs 1
--save_model

However, I am running into below error. Am I missing something?

2024-01-10 08:38:31 - Traceback (most recent call last):
2024-01-10 08:38:31 -   File "/home/datascience/decompressed_artifact/code/examples/finetuning.py", line 8, in <module>
2024-01-10 08:38:31 -     fire.Fire(main)
2024-01-10 08:38:31 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
2024-01-10 08:38:31 -     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
2024-01-10 08:38:32 -     component, remaining_args = _CallAndUpdateTrace(
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
2024-01-10 08:38:32 -     component = fn(*varargs, **kwargs)
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/llama_recipes/finetuning.py", line 160, in main
2024-01-10 08:38:32 -     dataset_config = generate_dataset_config(train_config, kwargs)
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/llama_recipes/utils/config_utils.py", line 56, in generate_dataset_config
2024-01-10 08:38:32 -     assert train_config.dataset in names, f"Unknown dataset: {train_config.dataset}"
2024-01-10 08:38:32 - AssertionError: Unknown dataset: custom_dataset

Error logs

2024-01-10 08:38:31 - Traceback (most recent call last):
2024-01-10 08:38:31 -   File "/home/datascience/decompressed_artifact/code/examples/finetuning.py", line 8, in <module>
2024-01-10 08:38:31 -     fire.Fire(main)
2024-01-10 08:38:31 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
2024-01-10 08:38:31 -     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
2024-01-10 08:38:32 -     component, remaining_args = _CallAndUpdateTrace(
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
2024-01-10 08:38:32 -     component = fn(*varargs, **kwargs)
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/llama_recipes/finetuning.py", line 160, in main
2024-01-10 08:38:32 -     dataset_config = generate_dataset_config(train_config, kwargs)
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/llama_recipes/utils/config_utils.py", line 56, in generate_dataset_config
2024-01-10 08:38:32 -     assert train_config.dataset in names, f"Unknown dataset: {train_config.dataset}"
2024-01-10 08:38:32 - AssertionError: Unknown dataset: custom_dataset

Expected behavior

Finetuning should work with custom dataset.

@HamidShojanazeri
Copy link
Contributor

@pankajtalk this works on my end, just want to make sure you already have installed llama-recipe right?

 torchrun --nnode=1 --nproc_per_node=8  examples/finetuning.py --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" --enable_fsdp --use_peft --peft_method lora --pure_bf16 --mixed_precision --batch_size_training 1 --model_name $MODEL_PATH --out
put_dir /home/datascience/outputs --num_epochs 1 --save_model"

@pankajtalk
Copy link
Author

pankajtalk commented Feb 8, 2024

@HamidShojanazeri The script invocation works fine for me if I do not specify the --dataset and --custom_dataset.file params. samsum dataset is being used in that case. Once I specify --dataset and --custom_dataset.file, I get the error I specified i.e. "Unknown dataset: custom_dataset" from config_utils.py.

Is there any param I can add to triage it further?

@pankajtalk
Copy link
Author

pankajtalk commented Feb 8, 2024

I think llama-recipes v0.0.1 (which seems to be the latest version) does not contain custom dataset. I checked dataset_utils.py and see

DATASET_PREPROC = {
    "alpaca_dataset": partial(get_alpaca_dataset, max_words=224),
    "grammar_dataset": get_grammar_dataset,
    "samsum_dataset": get_samsum_dataset, 
}   

whereas the one at https://github.com/facebookresearch/llama-recipes/blob/main/src/llama_recipes/utils/dataset_utils.py has

DATASET_PREPROC = {
    "alpaca_dataset": partial(get_alpaca_dataset),
    "grammar_dataset": get_grammar_dataset,
    "samsum_dataset": get_samsum_dataset,
    "custom_dataset": get_custom_dataset,
}

@pankajtalk
Copy link
Author

As per https://pypi.org/project/llama-recipes/#history, the only release of llama-recipes was on Sep 7, 2023. Any plans to release a newer version with latest code?

@HamidShojanazeri
Copy link
Contributor

@pankajtalk we are working on finalizing the release in the mean time can you pls install from src pip install -e .

@HamidShojanazeri
Copy link
Contributor

@HamidShojanazeri The script invocation works fine for me if I do not specify the --dataset and --custom_dataset.file params. samsum dataset is being used in that case. Once I specify --dataset and --custom_dataset.file, I get the error I specified i.e. "Unknown dataset: custom_dataset" from config_utils.py.

Is there any param I can add to triage it further?

I believe it should run open assistant .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants