Add openfunctionv2 model inference script and fix minor bug #360

JasonZhu1313 · 2024-04-15T16:19:37Z

Summary

This PR introduces a new model handler openfunctions_handler.py to run inference on OS model gorilla-llm/gorilla-openfunctions-v2 and reproduce the results on leaderboard

Issue: #352

Changes

Merge the input data into a single file since that's what the evaluation script is expecting, data is under /gorilla/berkeley-function-call-leaderboard/data/BFCL/questions.json
Fixed a minor bug in utils.py, the returned object function is mistakenly put inside the for loop, instead it should be put outside loop
Added a new OpenfunctionsHandler in handler map
openfunctions_handler.py with prompt template and decoding step compatible with openfunctions_v2 model
A simple handler_runner.py to run the inference and save the result for evaluation
Add instructions to readme.md

Test

Generates the inference result with vLLM

python model_handler/handler_runner.py --data-path /home/jobuser/gorilla/berkeley-function-call-leaderboard/data/gorilla_openfunctions_v1_test_all.json --model-name gorilla-llm/gorilla-openfunctions-v2 --model-path {PATH_TO_MODEL}/gorilla-openfunctions-v2/

Result: https://www.toptal.com/developers/paste-gd/GLTJTe4l

Generate eval result with eval_runner

generate evaluation for AST based metrics

python {PATH_TO_REPO}/gorilla/berkeley-function-call-leaderboard/eval_checker/eval_runner.py --model gorilla-llm/gorilla-openfunctions-v2 --skip-api-sanity-check --test-category simple sql relevance parallel_multiple_function parallel_function multiple_function

🦍 Model: gorilla-llm_gorilla-openfunctions-v2
2024-04-15 18:37:42,909 INFO worker.py:1616 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
🔍 Running test: parallel_multiple_function
✅ Test completed: parallel_multiple_function. 🎯 Accuracy: 0.695
2024-04-15 18:37:43,761 INFO worker.py:1454 -- Calling ray.init() again after it has already been called.
2024-04-15 18:37:43,774 INFO worker.py:1454 -- Calling ray.init() again after it has already been called.
🔍 Running test: parallel_function
✅ Test completed: parallel_function. 🎯 Accuracy: 0.845
2024-04-15 18:37:43,813 INFO worker.py:1454 -- Calling ray.init() again after it has already been called.
🔍 Running test: simple
(raylet) /home/jobuser/build/openconnect-lib-core-image/environments/satellites/python/lib/python3.10/site-packages/ray/dashboard/modules/reporter/reporter_agent.py:56: UserWarning: gpustat package is not installed. GPU monitoring is not available. To have full functionality of the dashboard please install pip install ray[default].)
(raylet) warnings.warn(
✅ Test completed: simple. 🎯 Accuracy: 0.845
2024-04-15 18:37:43,854 INFO worker.py:1454 -- Calling ray.init() again after it has already been called.
🔍 Running test: relevance
✅ Test completed: relevance. 🎯 Accuracy: 0.6875
2024-04-15 18:37:43,876 INFO worker.py:1454 -- Calling ray.init() again after it has already been called.
🔍 Running test: multiple_function
✅ Test completed: multiple_function. 🎯 Accuracy: 0.935

Leaderboard scores

Rank,Overall Acc,Model,Model Link,Organization,License,AST Summary,Exec Summary,Simple Function AST,Python Simple Function AST,Java Simple Function AST,JavaScript Simple Function AST,Multiple Functions AST,Parallel Functions AST,Parallel Multiple AST,Simple Function Exec,Python Simple Function Exec,REST Simple Function Exec,Multiple Functions Exec,Parallel Functions Exec,Parallel Multiple Exec,Relevance Detection,Cost ($ Per 1k Function Calls),Latency Mean (s),Latency Standard Deviation (s),Latency 95th Percentile (s)
1,80.48%,Gorilla-OpenFunctions-v2 (FC) from HuggingFace,https://huggingface.co/gorilla-llm/gorilla-openfunctions-v2,Gorilla LLM,Apache 2.0,83.00%,0.00%,84.50%,84.50%,0.00%,0.00%,93.50%,84.50%,69.50%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,68.75%,N/A,N/A,N/A,N/A

Fanjia-Yan · 2024-04-17T06:27:59Z

Hi Jason, on the onset, thank you so much for taking the time to review and modify our codebases. We appreciate your feedback and are actively working on verifying the result and merging the code.

Here is a list of actionable items we are going to take in the next few days:

We have a script eval_data_compilation.py that compiles all the data pulled from HuggingFace into a single file for vLLM batch inference. We are open to this change as it allows users to set up the inference pipeline easily using the HuggingFace Models. Here is what we plan to do:
- Remove all data from the ./data folder and put the 2000-line data file under HuggingFace.
- Modify apply_function_credential_config.py such that it will also apply credentials to the new data file.
We welcome the idea of handler_runner.py. This is an easier and simplified interface for locally deploying models to run on our evaluation. We would like to have a single reference point for model result generation. In other words, users can call openfunctions_evaluation.py only to accomplish data generation. Here is what we can do:
- We will merge the handler_runner.py content into openfunctions_evaluations.py . python model_handler/handler_runner.py --data-path /home/jobuser/gorilla/berkeley-function-call-leaderboard/data/gorilla_openfunctions_v1_test_all.json --model-name gorilla-llm/gorilla-openfunctions-v2 --model-path {PATH_TO_MODEL}/gorilla-openfunctions-v2/ → python model_handler/openfunctions_evaluation.py --model gorilla-llm/gorilla-openfunctions-v2
In eval_runner_helper.py, model_name → model_name_escaped change will introduce inconsistencies. The MODEL_METADATA_MAPPING will substitute ‘_’ with ‘/’ in the model name and will dedicate a raw model name mapped to the one displayed on the website. We are going to revert that change before merging in.
Regarding your inference result, we will run a local evaluation based on your modifications and check if our result matches. We will also check if the current handler matches what we have in our backend. We will respond to that soon with more information.

Again, thank you very much for improving our code quality and spot-check issue. Looking forward to collaborating on this issue.

BFCL Team

JasonZhu1313 added 2 commits April 15, 2024 09:09

Add initial code

7f0ebce

add input data

b0b977e

JasonZhu1313 mentioned this pull request Apr 15, 2024

[Reproducibility] OpenFunctions-v2: <Issue> Unable to reproduce the AST scores reported in leaderboard with OS checkpoint #352

Open

JasonZhu1313 added 3 commits April 15, 2024 11:44

update code

6d01f27

remove one line

9c3f228

update readme

a1a7c14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add openfunctionv2 model inference script and fix minor bug #360

Add openfunctionv2 model inference script and fix minor bug #360

JasonZhu1313 commented Apr 15, 2024 •

edited

Fanjia-Yan commented Apr 17, 2024

Add openfunctionv2 model inference script and fix minor bug #360

Are you sure you want to change the base?

Add openfunctionv2 model inference script and fix minor bug #360

Conversation

JasonZhu1313 commented Apr 15, 2024 • edited

Summary

Changes

Test

Fanjia-Yan commented Apr 17, 2024

JasonZhu1313 commented Apr 15, 2024 •

edited