Reproducing LIMAEval results #3

bangawayoo · 2024-05-07T11:03:24Z

Hi,
thank you for the interesting work!

I am trying to reproduce the results for LLaMA-2-7b on LIMAEval for the discard method.
I ran the evaluation script after generating with the release model KCA_Llama_2_7B_Discarding_Tuning using the default setting which calls gpt-4.

My results were slightly different from the result reported in the paper (30.95).

"hallucination_judge": {
"all_scores": 0.3220338983050847,
"error_cnt": 0
,}

I initially thought this was caused by the API model update.
However, the snapshot of gpt-4 points to gpt-4-0613 according to the OpenAI documentation.

Do you have a guess at why this might be happening?

For the record, on MS MACRO the ROUGE scores are also slightly off compared to Table 2

{
"ROUGE-1": 31.27,
"ROUGE-2": 20.0,
"ROUGE-L": 27.57,
"ROUGE-Lsum": 27.7
}

Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing LIMAEval results #3

Reproducing LIMAEval results #3

bangawayoo commented May 7, 2024 •

edited

Reproducing LIMAEval results #3

Reproducing LIMAEval results #3

Comments

bangawayoo commented May 7, 2024 • edited

bangawayoo commented May 7, 2024 •

edited