You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the README and searched the existing issues.
Reproduction
Hi There, I am observing a difference in output between llama factory inference and llama.cpp.
I am trying to convert a fine tuned microsoft/Phi-3-mini-128k-instruct model which was trained using LoRA. These are briefly the steps which I followed -
Fine-tune the pre trained model and get the fine tuned weight files.
Use the merge script in examples/merge_lora using the command - CUDA_VISIBLE_DEVICES=0 python ../../src/export_model.py --model_name_or_path microsoft/Phi-3-mini-128k-instruct --adapter_name_or_path /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-instruct-Prompt-Dataset-Equal-Sampled-v1-Sharegpt-700Epochs-Fashion/lora/sft/checkpoint-600/ --template default --finetuning_type lora --export_dir /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/ --export_size 8 --export_device cuda --export_legacy_format False
Then we get a combined file which is combination of Trained Model + LoRA adapters. Then i use the command to perform the inference using this combined safetensor weight file. The command used to perform inference is - CUDA_VISIBLE_DEVICES=0 python ../../src/cli_demo.py --model_name_or_path /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/ --template default
The output of the above command when used with a certain prompt is -
Now, I want to run the above model in the Ollama framework. To do that I need to convert the above combined model to gguf format.
I follow the below steps to do this -
I use the convert-hf-to-gguf.py script in llama.cpp to convert the above combined weights to get a gguf file. The command is python convert-hf-to-gguf.py /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New --outfile /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/Phi-3-mini-128k-Fashion-Full-New.gguf
Then using the above generated gguf file I perform the inference from llama.cpp using the command - ./main -m /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/Phi-3-mini-128k-Fashion-Full-New.gguf -ins --interactive-first --temp 0.01 -c 6000 --top-k 50 --top-p 0.7 -n 1024
When passed the same prompt which we passed in the above case i get the output as -
When we compare this to the above format there is a difference in the outputs of the two models. I have tried this multiple times but always get different outputs.
I have also tried mistral.rs inference framework which uses the safetensors file directly (no GGUF conversion needed, so comparable to the llama factory directly). The output in that case is also similar to the llama.cpp method which makes me believe that I am missing something while doing the inference on the other two frameworks or the llama_factory is using some file/parameters which other frameworks are not using.
My initial suspect is that this is due to the difference in the parameters of the inference. I tried using the python ../../src/cli_demo.py --help command to see all the parameters and values which are used during inference but there are so many parameters and no indications that which parameters among those are used during inference.
Can someone please help me in knowing on how can I solve this issue?
Expected behavior
The expected behaviour is that the output from the llama factory inference should match with the llama.cpp and mistral.rs framework.
Reminder
Reproduction
Hi There, I am observing a difference in output between llama factory inference and llama.cpp.
I am trying to convert a fine tuned microsoft/Phi-3-mini-128k-instruct model which was trained using LoRA. These are briefly the steps which I followed -
examples/merge_lora
using the command -CUDA_VISIBLE_DEVICES=0 python ../../src/export_model.py --model_name_or_path microsoft/Phi-3-mini-128k-instruct --adapter_name_or_path /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-instruct-Prompt-Dataset-Equal-Sampled-v1-Sharegpt-700Epochs-Fashion/lora/sft/checkpoint-600/ --template default --finetuning_type lora --export_dir /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/ --export_size 8 --export_device cuda --export_legacy_format False
CUDA_VISIBLE_DEVICES=0 python ../../src/cli_demo.py --model_name_or_path /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/ --template default
Now, I want to run the above model in the Ollama framework. To do that I need to convert the above combined model to gguf format.
I follow the below steps to do this -
python convert-hf-to-gguf.py /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New --outfile /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/Phi-3-mini-128k-Fashion-Full-New.gguf
./main -m /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/Phi-3-mini-128k-Fashion-Full-New.gguf -ins --interactive-first --temp 0.01 -c 6000 --top-k 50 --top-p 0.7 -n 1024
When we compare this to the above format there is a difference in the outputs of the two models. I have tried this multiple times but always get different outputs.
I have also tried mistral.rs inference framework which uses the safetensors file directly (no GGUF conversion needed, so comparable to the llama factory directly). The output in that case is also similar to the llama.cpp method which makes me believe that I am missing something while doing the inference on the other two frameworks or the llama_factory is using some file/parameters which other frameworks are not using.
My initial suspect is that this is due to the difference in the parameters of the inference. I tried using the
python ../../src/cli_demo.py --help
command to see all the parameters and values which are used during inference but there are so many parameters and no indications that which parameters among those are used during inference.Can someone please help me in knowing on how can I solve this issue?
Expected behavior
The expected behaviour is that the output from the llama factory inference should match with the llama.cpp and mistral.rs framework.
System Info
The below is from the llama_factory environment -
transformers
version: 4.40.1The below is from the llama.cpp environment -
transformers
version: 4.40.1Others
No response
The text was updated successfully, but these errors were encountered: