Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MiniCPM-V-2 #637

Open
abrichr opened this issue Apr 25, 2024 · 2 comments
Open

Implement MiniCPM-V-2 #637

abrichr opened this issue Apr 25, 2024 · 2 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@abrichr
Copy link
Contributor

abrichr commented Apr 25, 2024

Feature request

https://github.com/OpenBMB/MiniCPM is the smallest multimodal model available. The latest version, https://huggingface.co/openbmb/MiniCPM-V-2, appears to be able to understand GUI images:

Describe the given image with as much detail as possible.

image

The image features a calculator screen displaying the number 45. The interface includes various mathematical symbols and operators, such as plus (+), minus (-) signs with corresponding buttons for each operation: addition ("="), subtraction ('÷'), multiplication (*", division("/"),"and "%) percentage sign(s)"%"...
There are also indicators of different types in green colors like 'O', indicating an operator or function button on one side while there're numbers from zero to nine represented by numerical keys elsewhere along this section within its boundary which suggests that it is designed primarily for performing arithmetic calculations rather than any complex math operations beyond simple algebra rules due to absence visible functions related directly connected through keyboard shortcuts via icons similar those seen at top left corner

This model has been integrated into https://github.com/vllm-project/vllm, which is likely how we want to implement it here.

This involves creating an openadapt.adapters.vllm with parameterizable model name.

Motivation

Fully offline support.

@abrichr
Copy link
Contributor Author

abrichr commented May 29, 2024

Related: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard

https://github.com/OpenGVLab/InternVL appears to surpass MiniCPM on the benchmark. However, testing suggests it does not perform as well on GUI elements. From https://internvl.opengvlab.com/ (same image as above):


user upload image
Describe the given image with as much detail as possible.

The image displays a calculator with a gray background and orange buttons. The calculator shows the number 45 in its display. The buttons are arranged in a typical layout with numbers 0 through 9, and function buttons such as addition, subtraction, multiplication, division, and percentage. The calculator also has buttons for clear, memory, and a decimal point. The top left corner of the calculator has three colored dots, which could indicate different modes or functions of the calculator.

Enumerate all interactable GUI elements

The interactable GUI elements in the image include:

The display showing the number "45".
The "C" button, which is likely for clearing the current input or calculation.
The "CE" button, which typically stands for "Clear Entry" and is used to clear the last input or part of the current input.
The "MC" button, which is likely for clearing the memory or the stored value.
The "M-" button, which is likely for recalling the last stored value from memory.
The "M+" button, which is likely for storing the current value in memory.
The "7", "8", "9" buttons for inputting numbers.
The "4", "5", "6" buttons for inputting numbers.
The "1", "2", "3" buttons for inputting numbers.
The "0" button for inputting numbers.
The "." button for inputting the decimal point.
The "+" button for addition.
The "-" button for subtraction.
The "x" button for multiplication.
The "/" button for division.
The "%" button for calculating the percentage.
The "=" button to execute the calculation and display the result.
These buttons are designed to be pressed by the user to perform calculations and input numbers on the calculator.

@abrichr
Copy link
Contributor Author

abrichr commented May 29, 2024

Also worth considering: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant