Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor question about PAD token and EOS token. #127

Open
HaniItani opened this issue Feb 26, 2024 · 2 comments
Open

Minor question about PAD token and EOS token. #127

HaniItani opened this issue Feb 26, 2024 · 2 comments

Comments

@HaniItani
Copy link

HaniItani commented Feb 26, 2024

Hello,

Thank you for sharing this awesome resource!

I have a question regarding models that already have a chat template like "mistralai/Mistral-7B-Instruct-v0.1". I'm planning on using the non packed dataset. I applied the chat template that comes with the tokenizer as a preprocessing step as suggested. If I decode the samples inside the SFTTrainer after tokenization, they start with two BOS tokens. This is because the tokenizer adds a special token (BOS token in this case because it is set to True in the tokenizer config) in addition to the one in the chat template. To fix this, I need to pass dataset_kwargs={"add_special_tokens": False} to the SFTTrainer.

Another issue I'm having is that when the pad token is the same as the EOS token, the EOS token label is -100. This might cause the model to continue generating and never stop, right? I'm having this "phenomena" with my finetuned models on my own dataset using the SFT code provided. One workaround would be to code my own data collator that takes this into account instead of using DataCollatorForLanguageModeling. I also found a related issue on the matter here.

Any comments and guidance are very much appreciated!

@LittlePea13
Copy link

Setting the pad token to eos is an issue on our training as well. What I do not get is how Zephyr was trained with such recipe, since Mistral does not have a pad token, the same problem arises, and its chat template includes an eos at the end of each conversation turn. So while the same thing should happen when training on top of Mistral, HuggingFaceH4/mistral-7b-sft-beta seems able to generate eos tokens just fine.

Was this addressed in any way during training of Zephyr?

@wj210
Copy link

wj210 commented Apr 22, 2024

This is true, i have tried SFT using the script above. And the model does not learn how to stop generating.
The sft script uses the default DataCollatorForLanguageModelling and if you see https://github.com/huggingface/transformers/blob/0d84901cb7e797c90653e2c8ca2ce2a6b3498208/src/transformers/data/data_collator.py#L778C49-L778C61

it basically sets all pad_token_id to be ignored. This is regardless if packing is used.

I think there is only 2 ways about this.

  1. Set pad token id separately.
    such as
def resize_pad_embeddings(model,tokenizer): # only for alpaca-trained
    pad_token = "[PAD]"
    special_tokens_dict = dict(pad_token=pad_token)
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict) 
    model.resize_token_embeddings(len(tokenizer))
    if num_new_tokens > 0:
        input_embeddings_data = model.get_input_embeddings().weight.data
        output_embeddings_data = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings_data[-num_new_tokens:] = input_embeddings_avg
        output_embeddings_data[-num_new_tokens:] = output_embeddings_avg

or 2) Use DataCollatorForSeq2Seq, it doesn't automatically set pad_token_id to ignore_index but rather pads to the longest and append ignore index to the the N-1 shorter batches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants