BOS ,and EOS tokens #7057

walidbet18 · 2024-05-03T09:28:48Z

Hi everyone !
I have a question it might be dumb but i want to understand\

llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''

i know and understand what does these tokens means , to be honest i undertand that by translation tasks , but for taks like question/answer i don't understand how they works because sometimes the answer is very wide then the question , so how it works and can i modify them in llama.cpp and with what criteria ?

Jeximo · 2024-05-03T12:49:10Z

i don't understand how they works because sometimes the answer is very wide

Hi. BOS means beginning of sentence, and EOS means end of sentence. Usually they're special tokens in the model for llama.cpp text generation.

llama.cpp automatically inserts a BOS token for the most part. As for EOS tokens, it depends on the model. Here's an example:

./main ~/model.gguf -cml -p "What's 5+5?"

-cml automatically fills in both the BOS and EOS token for the prompt(BOS token before What's, EOS token after 5?), assuming it's a chatml model.

walidbet18 · 2024-05-03T12:57:11Z

@Jeximo thanks for your answer , i understand that but what i'm trying to do here is to fine-tune my model using a text file similar to this "function1(int , string ,bool) -> none this method take bool int and string as parametres ,function2() takes no arguments ..... etc " i'm just wondering how the model would know where to stop if i'll ask him to return function1 method , how would he know that he have just return "function1(int , string ,bool) -> none this method take bool int and string as parametres" and not all the text

this is why i end up wondering if BOS and EOS would give me an idea

arnfaldur · 2024-05-14T20:21:31Z

Your question seems to be around dataset creation. As I understand it, a dataset consists of multiple snippets of text like you describe of various sizes. During training, the snippets are surrounded by EOS, BOS tokens and concatenated and then fed through the model.

I suggest that you close this issue as it's not really an issue related to llama.cpp. You can definitely find good resources on dataset creation and LLM training techniques somewhere on the internet.

walidbet18 added the bug-unconfirmed label May 3, 2024

walidbet18 changed the title ~~llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS toke= 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>'~~ BOS ,and EOS tokens May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BOS ,and EOS tokens #7057

BOS ,and EOS tokens #7057

walidbet18 commented May 3, 2024 •

edited

Jeximo commented May 3, 2024

walidbet18 commented May 3, 2024

arnfaldur commented May 14, 2024

BOS ,and EOS tokens #7057

BOS ,and EOS tokens #7057

Comments

walidbet18 commented May 3, 2024 • edited

Jeximo commented May 3, 2024

walidbet18 commented May 3, 2024

arnfaldur commented May 14, 2024

walidbet18 commented May 3, 2024 •

edited