Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOS ,and EOS tokens #7057

Open
walidbet18 opened this issue May 3, 2024 · 3 comments
Open

BOS ,and EOS tokens #7057

walidbet18 opened this issue May 3, 2024 · 3 comments

Comments

@walidbet18
Copy link

walidbet18 commented May 3, 2024

Hi everyone !
I have a question it might be dumb but i want to understand\

llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''

i know and understand what does these tokens means , to be honest i undertand that by translation tasks , but for taks like question/answer i don't understand how they works because sometimes the answer is very wide then the question , so how it works and can i modify them in llama.cpp and with what criteria ?

@walidbet18 walidbet18 changed the title llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS toke= 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' BOS ,and EOS tokens May 3, 2024
@Jeximo
Copy link
Contributor

Jeximo commented May 3, 2024

i don't understand how they works because sometimes the answer is very wide

Hi. BOS means beginning of sentence, and EOS means end of sentence. Usually they're special tokens in the model for llama.cpp text generation.

llama.cpp automatically inserts a BOS token for the most part. As for EOS tokens, it depends on the model. Here's an example:

./main ~/model.gguf -cml -p "What's 5+5?"

-cml automatically fills in both the BOS and EOS token for the prompt(BOS token before What's, EOS token after 5?), assuming it's a chatml model.

@walidbet18
Copy link
Author

@Jeximo thanks for your answer , i understand that but what i'm trying to do here is to fine-tune my model using a text file similar to this "function1(int , string ,bool) -> none this method take bool int and string as parametres ,function2() takes no arguments ..... etc " i'm just wondering how the model would know where to stop if i'll ask him to return function1 method , how would he know that he have just return "function1(int , string ,bool) -> none this method take bool int and string as parametres" and not all the text

this is why i end up wondering if BOS and EOS would give me an idea

@arnfaldur
Copy link

Your question seems to be around dataset creation. As I understand it, a dataset consists of multiple snippets of text like you describe of various sizes. During training, the snippets are surrounded by EOS, BOS tokens and concatenated and then fed through the model.

I suggest that you close this issue as it's not really an issue related to llama.cpp. You can definitely find good resources on dataset creation and LLM training techniques somewhere on the internet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants