Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output difference between LLaMA-Factory and llama.cpp #3563

Open
1 task done
anidh opened this issue May 3, 2024 · 1 comment
Open
1 task done

Output difference between LLaMA-Factory and llama.cpp #3563

anidh opened this issue May 3, 2024 · 1 comment
Labels
pending This problem is yet to be addressed.

Comments

@anidh
Copy link

anidh commented May 3, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

Hi There, I am observing a difference in output between llama factory inference and llama.cpp.

I am trying to convert a fine tuned microsoft/Phi-3-mini-128k-instruct model which was trained using LoRA. These are briefly the steps which I followed -

  1. Fine-tune the pre trained model and get the fine tuned weight files.
  2. Use the merge script in examples/merge_lora using the command - CUDA_VISIBLE_DEVICES=0 python ../../src/export_model.py --model_name_or_path microsoft/Phi-3-mini-128k-instruct --adapter_name_or_path /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-instruct-Prompt-Dataset-Equal-Sampled-v1-Sharegpt-700Epochs-Fashion/lora/sft/checkpoint-600/ --template default --finetuning_type lora --export_dir /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/ --export_size 8 --export_device cuda --export_legacy_format False
  3. Then we get a combined file which is combination of Trained Model + LoRA adapters. Then i use the command to perform the inference using this combined safetensor weight file. The command used to perform inference is - CUDA_VISIBLE_DEVICES=0 python ../../src/cli_demo.py --model_name_or_path /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/ --template default
  4. The output of the above command when used with a certain prompt is -
tier: None
gender: None
location: NY
generation: genz
category: None
product_type: None

Now, I want to run the above model in the Ollama framework. To do that I need to convert the above combined model to gguf format.
I follow the below steps to do this -

  1. I use the convert-hf-to-gguf.py script in llama.cpp to convert the above combined weights to get a gguf file. The command is python convert-hf-to-gguf.py /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New --outfile /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/Phi-3-mini-128k-Fashion-Full-New.gguf
  2. Then using the above generated gguf file I perform the inference from llama.cpp using the command -
    ./main -m /mnt/9b6a1070-4fbf-4c22-a68b-043ec1f59e46/anidh_ckpt/checkpoints/Phi-3-mini-128k-Fashion-Full-New/Phi-3-mini-128k-Fashion-Full-New.gguf -ins --interactive-first --temp 0.01 -c 6000 --top-k 50 --top-p 0.7 -n 1024
  3. When passed the same prompt which we passed in the above case i get the output as -
{
  "tier": "Nano",
  "gender": "None",
  "location": "NY",
  "category": "party wear",
  "generation": "genz"
}

When we compare this to the above format there is a difference in the outputs of the two models. I have tried this multiple times but always get different outputs.

I have also tried mistral.rs inference framework which uses the safetensors file directly (no GGUF conversion needed, so comparable to the llama factory directly). The output in that case is also similar to the llama.cpp method which makes me believe that I am missing something while doing the inference on the other two frameworks or the llama_factory is using some file/parameters which other frameworks are not using.

My initial suspect is that this is due to the difference in the parameters of the inference. I tried using the python ../../src/cli_demo.py --help command to see all the parameters and values which are used during inference but there are so many parameters and no indications that which parameters among those are used during inference.

Can someone please help me in knowing on how can I solve this issue?

Expected behavior

The expected behaviour is that the output from the llama factory inference should match with the llama.cpp and mistral.rs framework.

System Info

The below is from the llama_factory environment -

  • transformers version: 4.40.1
  • Platform: Linux-5.15.0-97-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.3
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Parallel

The below is from the llama.cpp environment -

  • transformers version: 4.40.1
  • Platform: Linux-5.15.0-97-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.23.0
  • Safetensors version: 0.4.3
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: :Parallel

Others

No response

@hiyouga hiyouga added the pending This problem is yet to be addressed. label May 3, 2024
@anidh
Copy link
Author

anidh commented May 6, 2024

Hi @hiyouga if there is a way to know only inference time parameters that can also be a good starting point for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed.
Projects
None yet
Development

No branches or pull requests

2 participants