parallel inferrencing producing the unknown token (token id 0) on quantized Mistral and Mixtral with added tokens with CUDA backend #7047

Ralakus · 2024-05-02T16:03:49Z

Summary

When using the CUDA backend, parallel inferrencing generates the unknown token (token id 0) in all sessions except one. This issue doesn't occur when there is only one client inferrencing or when llama.cpp is running with no layers offloaded to the GPU.

The unknown tokens get decoded as the UTF-8 Lower Five Eighths Block (▅ U+2585) character and seems to turn the rest of the output into a jumbled mess of text that partially resembles a response.

Steps to Reproduce and Output

Download the 4-bit quantized finetuned model.
Build llama.cpp using CUDA backend

Parallel example program using CUDA

llama.cpp/build/bin/parallel -m Mixtral-Test-Finetune/ggml-model-q4_k_m.gguf -ngl 999 -c 32768 --parallel 4 --sequences 4 --seed 25519

Full output

Log

�[31mClient   0, seq   0/  4, prompt   11 t, response  102 t, time  4.64 s, speed 24.35 t/s, cache miss 0 �[0m 
Input:    What is the meaning of life?
�[35mResponse: The meaning of life is a philosophical question that has been debated throughout history. Some people believe that the meaning of life is to seek happiness, fulfillment, and personal growth. Others believe that the meaning of life is to serve a higher purpose, such as a god or a moral principle. Still others believe that the meaning of life is to create meaning for oneself through one's experiences and relationships. Ultimately, the meaning of life is a subjective concept that varies from person to person.</s>�[0m

�[31mClient   2, seq   2/  4, prompt   11 t, response  114 t, time  5.10 s, speed 24.53 t/s, cache miss 0 �[0m 
Input:    What is the meaning of life?
�[35mResponse: The meaning of life is a philosophical question that has▅p been debated▅sp for centuries.▅▅▅▅▅sp There is no one answer that is true for everyone. Some people believe that the meaning of life is to be happy, while others believe that it is to be successful. Some people believe that the meaning of life is to make a difference in the world, while others believe that it is to find inner peace. Ultimately, the meaning of▅▅▅sp life is a▅sp personal matter that each individual must decide for themselves.�[0m

�[31mClient   3, seq   3/  4, prompt   22 t, response  128 t, time  5.52 s, speed 27.17 t/s, cache miss 0 �[0m 
Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
�[35mResponse: Yes, I can▅ explain the Special Theory of▅Relativity▅Relativity to▅▅Relativity you▅▅▅▅▅Relativity. The Special Theory▅Relativity of Relativity is a▅▅Rel▅▅▅▅▅▅Rel▅Relativity▅▅▅▅▅▅Relativity▅▅▅▅Rel▅Rel▅▅▅▅▅▅▅▅Rel▅Relativity▅▅▅▅▅▅▅Relativity▅              ▅Rel▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅Rel▅Relativity was proposed by Albert Einstein in 190▅</s>�[0m

�[31mClient   1, seq   1/  4, prompt   15 t, response  250 t, time  7.67 s, speed 34.53 t/s, cache miss 0 �[0m 
Input:    What is the best way to cook a steak?
�[35mResponse: The best way to▅: 1. Pre▅: heat the oven▅: to 3▅▅▅▅▅▅▅▅▅:▅: ▅▅▅: 7▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅:▅▅▅▅: 5▅▅▅▅▅▅▅▅▅▅▅:▅▅: ▅▅▅▅:▅:▅▅▅▅▅▅▅▅▅▅▅▅▅: 200°F. 2. Season the steak with salt and pepper. 3. Heat a cast iron skillet over high heat. 4. Add a small amount of oil to the skillet. 5. When the oil is hot, add the steak to the skillet. 6. Sear the steak for 2-3 minutes on each side. 7. Transfer the skillet to the preheated oven. 8. Cook the steak in the oven for 5-7 minutes for medium-rare. 9. Remove the steak from the oven and let it rest for 5-10 minutes before slicing it. 10. Enjoy!�[0m

Server parallel client test script using CUDA

Python script

Output

Parallel example program using CPU

llama.cpp/build/bin/parallel -m Mixtral-Test-Finetune/ggml-model-q4_k_m.gguf -ngl 0 -c 32768 --parallel 4 --sequences 4 --seed 25519

Full output

Log

�[31mClient   1, seq   1/  4, prompt   15 t, response   14 t, time  8.32 s, speed  3.49 t/s, cache miss 0 �[0m 
Input:    What is the best way to cook a steak?
�[35mResponse: The best way to cook a steak is to follow these steps:�[0m

�[31mClient   0, seq   0/  4, prompt   11 t, response   79 t, time 34.95 s, speed  2.58 t/s, cache miss 0 �[0m 
Input:    What is the meaning of life?
�[35mResponse: The meaning of life is a philosophical question that has been debated throughout history. Some people believe that the meaning of life is to seek happiness, fulfillment, and personal growth. Others believe that the meaning of life is to serve a higher purpose, such as a god or a moral principle. Ultimately, the meaning of life is a personal belief that each individual must determine for themselves.</s>�[0m

�[31mClient   2, seq   2/  4, prompt   11 t, response  105 t, time 44.61 s, speed  2.60 t/s, cache miss 0 �[0m 
Input:    What is the meaning of life?
�[35mResponse: The meaning of life is a philosophical question that has been debated throughout history. Some people believe that the meaning of life is to seek happiness, fulfillment, and personal growth. Others believe that the meaning of life is to serve a higher purpose, such as a god or a moral principle. Still others believe that life has no inherent meaning, and that it is up to each individual to create their own purpose. Ultimately, the meaning of life is a subjective question that each person must answer for themselves.�[0m

�[31mClient   3, seq   3/  4, prompt   22 t, response  105 t, time 44.61 s, speed  2.85 t/s, cache miss 0 �[0m 
Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
�[35mResponse: Yes, I can explain the Special Theory of Relativity. The Special Theory of Relativity was proposed by Albert Einstein in 1905. It is a theory of physics that describes the behavior of objects that are moving at constant speeds in a straight line. The theory is based on two postulates: (1) the laws of physics are the same in all inertial frames of reference, and (2) the speed of light in a vacuum is the same for all observers, regardless of their motion.�[0m

Server parallel client test script using CPU

Python script

Output

Work-around

Setting logit_bias of the unknown token to never generate (or some really low value like -999999999) seems to get around the llama.cpp generating unknown tokens when generating in parallel. While setting the logit_bias of token 0 to never generate bypasses the issue, it would be beneficial to figure out the discrepancy between CPU and CUDA inferencing.

Parallel example program

llama.cpp/build/bin/parallel -m Mixtral-Test-Finetune/ggml-model-q4_k_m.gguf -ngl 999 -c 32768 --parallel 4 --sequences 4 --seed 25519 --logit-bias 0-999999999

Full output

Log

�[31mClient   1, seq   1/  4, prompt   15 t, response   14 t, time  0.55 s, speed 52.82 t/s, cache miss 0 �[0m 
Input:    What is the best way to cook a steak?
�[35mResponse: The best way to cook a steak is to follow these steps:�[0m

�[31mClient   1, seq    2, started decoding ...�[0m
�[31mClient   0, seq   0/  4, prompt   11 t, response  102 t, time  3.23 s, speed 35.01 t/s, cache miss 0 �[0m 
Input:    What is the meaning of life?
�[35mResponse: The meaning of life is a philosophical question that has been debated throughout history. Some people believe that the meaning of life is to seek happiness, fulfillment, and personal growth. Others believe that the meaning of life is to serve a higher purpose, such as a god or a moral principle. Still others believe that the meaning of life is to create meaning for oneself through one's experiences and relationships. Ultimately, the meaning of life is a subjective concept that varies from person to person.</s>�[0m

�[31mClient   0, seq    3, started decoding ...�[0m
�[31mClient   1, seq   2/  4, prompt   11 t, response  101 t, time  3.18 s, speed 35.21 t/s, cache miss 0 �[0m 
Input:    What is the meaning of life?
�[35mResponse: The meaning of life is a philosophical question that has been debated by scholars and thinkers for centuries. There is no one answer to this question, as it depends on each individual's personal beliefs and values. Some people believe that the meaning of life is to find happiness and fulfillment, while others believe that it is to make a positive impact on the world. Ultimately, the meaning of life is a deeply personal and subjective concept that can only be determined by each individual for themselves.�[0m

�[31mClient   0, seq   3/  4, prompt   22 t, response  100 t, time  2.02 s, speed 60.28 t/s, cache miss 0 �[0m 
Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
�[35mResponse: Yes, I am familiar with the Special Theory of Relativity. The Special Theory of Relativity is a theory of physics that was developed by Albert Einstein in 1905. The theory is based on two postulates: first, that the laws of physics are the same in all inertial frames of reference; and second, that the speed of light in a vacuum is the same in all inertial frames of reference, regardless of the motion of the light source or the observer.�[0m

Server parallel client test script

Python script

Full output

System Specifications

llama.cpp commit f364eb6
Oracle Linux 8 with Kernel 5.15.0-203.146.5.1.el8uek.x86_64
Threadripper 7960X
Nvidia RTX A6000 48GB
- Driver Version: 550.54.14
- CUDA Version: 12.4

The text was updated successfully, but these errors were encountered:

slaren · 2024-05-02T16:39:45Z

Problems with finetunes with CUDA are often caused by the model producing values that cannot be represented in a 16-bit float. If that's the case, building with LLAMA_CUDA_FORCE_MMQ may help.

Ralakus · 2024-05-02T19:43:25Z

Compiling with LLAMA_CUDA_FORCE_MMQ didn't seem to fix the issue. What I'm most curious about is why it only happens when running parallel inferencing and goes back to normal with a logit bias.

Parallel output with force mmq
Parallel log with force mmq

Ralakus · 2024-05-08T16:52:42Z

I did a lot more tests and found out that the finetuning actually wasn't the issue. Adding any extra tokens to the tokenizer before converting the model AND then quantizing the model with the added tokens causes this with Mistral and Mixtral models. This issue doesn't occur when running the model using f16 and f32.

I did the tests on the Mistral Instruct v0.2 model with an added padding token.

Create model script

Q4_K_M Parallel out
Q4_K_M Parallel log

Q4_K_M Parallel out with logit bias
Q4_K_M Parallel log with logit bias

F16 Parallel out
F16 Parallel log

F32 Parallel out
F32 Parallel log

Ralakus added the bug-unconfirmed label May 2, 2024

Ralakus changed the title ~~parallel inferrencing producing the unknown token (token id 0) on finetuned mixtral with CUDA backend~~ parallel inferrencing producing the unknown token (token id 0) on quantized Mistral and Mixtral with added tokens with CUDA backend May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel inferrencing producing the unknown token (token id 0) on quantized Mistral and Mixtral with added tokens with CUDA backend #7047

parallel inferrencing producing the unknown token (token id 0) on quantized Mistral and Mixtral with added tokens with CUDA backend #7047

Ralakus commented May 2, 2024

slaren commented May 2, 2024

Ralakus commented May 2, 2024

Ralakus commented May 8, 2024

parallel inferrencing producing the unknown token (token id 0) on quantized Mistral and Mixtral with added tokens with CUDA backend #7047

parallel inferrencing producing the unknown token (token id 0) on quantized Mistral and Mixtral with added tokens with CUDA backend #7047

Comments

Ralakus commented May 2, 2024

Summary

Steps to Reproduce and Output

Parallel example program using CUDA

Server parallel client test script using CUDA

Parallel example program using CPU

Server parallel client test script using CPU

Work-around

Parallel example program

Server parallel client test script

System Specifications

slaren commented May 2, 2024

Ralakus commented May 2, 2024

Ralakus commented May 8, 2024