Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing LIMAEval results #3

Open
bangawayoo opened this issue May 7, 2024 · 0 comments
Open

Reproducing LIMAEval results #3

bangawayoo opened this issue May 7, 2024 · 0 comments

Comments

@bangawayoo
Copy link

bangawayoo commented May 7, 2024

Hi,
thank you for the interesting work!

I am trying to reproduce the results for LLaMA-2-7b on LIMAEval for the discard method.
I ran the evaluation script after generating with the release model KCA_Llama_2_7B_Discarding_Tuning using the default setting which calls gpt-4.

My results were slightly different from the result reported in the paper (30.95).

"hallucination_judge": {
"all_scores": 0.3220338983050847,
"error_cnt": 0
,}

I initially thought this was caused by the API model update.
However, the snapshot of gpt-4 points to gpt-4-0613 according to the OpenAI documentation.

Do you have a guess at why this might be happening?

For the record, on MS MACRO the ROUGE scores are also slightly off compared to Table 2

{
"ROUGE-1": 31.27,
"ROUGE-2": 20.0,
"ROUGE-L": 27.57,
"ROUGE-Lsum": 27.7
}

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant