Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support cogvlm-chat #1502

Merged
merged 45 commits into from
Jun 4, 2024
Merged

Conversation

RunningLeon
Copy link
Collaborator

@RunningLeon RunningLeon commented Apr 26, 2024

Motivation

Support cogvlm-chat-hf and CogVLM2 for pytorch engine

Usage:

Warning

CogVLM-Chat-hf uses 'lmsys/vicuna-7b-v1.5' as tokenizer, you need to copy the tokenizer model and configs into CogVLM model directory.

from lmdeploy import pipeline
from lmdeploy.vl import load_image

model_path = './models--THUDM--cogvlm-chat-hf'

pipe = pipeline(model_path)

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Modification

TODOs

BC-breaking (Optional)

Profiling

tp=1 batch_size=128 num-prompts=3000

cogvlm-chat-hf

Without images
concurrency: 128
elapsed_time: 282.105s

first token latency(s)(min, max, ave): 0.859, 5.820, 1.471
per-token latency(s) percentile(50, 75, 95, 99): [0.032, 0.033, 0.111, 0.237]

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 2571.792 token/s
token throughput (prompt + completion token): 5260.009 token/s
RPS (request per second): 10.634 req/s
RPM (request per minute): 638.060 req/min
with one image

change profile_throught.py and prepend 1234 tokens and image embeddings to each prompt

concurrency: 128
elapsed_time: 1066.815s

first token latency(s)(min, max, ave): 0.881, 39.275, 31.872
per-token latency(s) percentile(50, 75, 95, 99): [0.033, 0.036, 0.266, 0.334]

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 680.077 token/s
token throughput (prompt + completion token): 1390.940 token/s
RPS (request per second): 2.812 req/s
RPM (request per minute): 168.726 req/min
REST API

using PR #1662

concurrency: 16
elapsed_time: 447.581s

first_token latency(min, max, ave): 0.055s, 7.081s, 1.278s

number of prompt tokens: 248339
number of completion tokens: 240582
token throughput (completion token): 537.516 token/s
token throughput (prompt + completion token): 1092.364 token/s
RPS (request per second): 2.234 req/s
RPM (request per minute): 134.054 req/min

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@RunningLeon RunningLeon changed the title [WIP]: Support cogvlm [WIP]: Support cogvlm-chat May 10, 2024
@RunningLeon RunningLeon marked this pull request as ready for review May 15, 2024 12:24
@RunningLeon RunningLeon changed the title [WIP]: Support cogvlm-chat [Feature]: Support cogvlm-chat May 15, 2024
@RunningLeon RunningLeon removed the WIP label May 15, 2024
Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeploy.readthedocs.io/en/latest/get_started.html#installation) for more.

```shell
pip install lmdeploy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xformers should be installed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I installed xformers. It will install torch 2.3.0
But lmdeploy requires torch<=2.2.2,>=2.0.0

Copy link
Collaborator

@lvhan028 lvhan028 Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better guide users about installing xformers before lmdeploy.

@lvhan028
Copy link
Collaborator

@zhulinJulia24 please add cogvlm and cogvlm2 into test cases

lmdeploy/vl/model/cogvlm.py Outdated Show resolved Hide resolved
@pseudotensor
Copy link

pseudotensor commented May 31, 2024

Curious if you guys would know why cogvlm2 would hit these kinds of issues with transformers. I don't have a repro, but also noticed the same thing in sglang. I'm wondering if I'm misusing the mode or something.
THUDM/CogVLM2#68
The server code is just fastapi wrapper (single thread at a time) with transformers:
https://github.com/h2oai/h2ogpt/blob/main/openai_server/cogvlm2_server/cogvlm2.py
This is just based upon their code:
https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_demo.py
client side: https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_request.py
Many things work, then just hits issues and all dead.
Doesn't seem to be GPU OOM, because 40GB/80GB still left.
I bring it up, because I'd like to use lmdeploy for cogvlm2, but I'm worried something is not right.

@pseudotensor hi, maybe you could try with lmdeploy using the latest code from this PR to see if it happens. You can refer to this doc: https://lmdeploy.readthedocs.io/en/latest/serving/api_server_vl.html#launch-service

Ok, trying now. First building docker image from this PR:

docker build . -f docker/Dockerfile -t cogvlm2 --no-cache

@pseudotensor
Copy link

I modified the docker llava-like thing for this cogvlm2 case:

# git clone https://github.com/InternLM/lmdeploy.git
# cd lmdeploy
# git fetch origin pull/1502/head:pr-1502
# git checkout pr-1502
# docker build . -f docker/Dockerfile -t cogvlm2
# cd ~/h2ogpt_ops
# docker build - < Dockerfile.cogvlm2 -t cogvlm2_internalvl

FROM cogvlm2:latest

RUN apt-get update && apt-get install -y python3 python3-pip git

WORKDIR /app

RUN pip3 install --upgrade pip
RUN pip3 install timm xformers triton==2.2.0
RUN pip3 install git+https://github.com/haotian-liu/LLaVA.git --no-deps

COPY . .

CMD ["lmdeploy", "serve", "api_server", "THUDM/cogvlm2-llama3-chat-19B"]

And notice on startup:

2024-06-01 00:39:16,053 - lmdeploy - WARNING - Fallback to pytorch engine because `/root/.cache/huggingface/hub/models--THUDM--cogvlm2-llama3-chat-19B/snapshots/2bf7de6892877eb50142395af14847519ba95998` not supported by turbomind engine.

is that ok?

@pseudotensor
Copy link

pseudotensor commented Jun 1, 2024

A quite strange response the transformers use of the model doesn't do. Seems like the prompting is off. This is with no image:

image

Another funny one:

image

Another bad one:

image

It isn't always bad, but something seems off. Never noticed such oddities with the cogvlm2 demos locally.

But if I pass an image it responds ok:

image

@pseudotensor
Copy link

pseudotensor commented Jun 1, 2024

I see this in logs. Maybe something uninteded going on? It's ok as long as not doing CUDA in a fork.

INFO:     172.16.0.225:19544 - "POST /v1/chat/completions HTTP/1.1" 200 OK
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

@RunningLeon
Copy link
Collaborator Author

RunningLeon commented Jun 3, 2024

I see this in logs. Maybe something uninteded going on? It's ok as long as not doing CUDA in a fork.

INFO:     172.16.0.225:19544 - "POST /v1/chat/completions HTTP/1.1" 200 OK
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

@pseudotensor hi, are you using with tp>1? if so, you need to include your code in main function as

from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.vl import load_image

if __name__ == '__main__':
    pipe = pipeline('cogvlm-chat-hf', backend_config=PytorchEngineConfig(tp=2, max_prefill_token_num=4096, cache_max_entry_count=0.8))
    
    image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
    response = pipe(('describe this image', image))
    print(response)

@RunningLeon
Copy link
Collaborator Author

RunningLeon commented Jun 3, 2024

A quite strange response the transformers use of the model doesn't do. Seems like the prompting is off. This is with no image:

image

Another funny one:

image

Another bad one:

image

It isn't always bad, but something seems off. Never noticed such oddities with the cogvlm2 demos locally.

But if I pass an image it responds ok:

image

@pseudotensor hi, in cogvlm2's demo. The prompt is wrapped with text only template for a session without image input as in here. Can you try again with 7080da2 .

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
).to(DEVICE).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:
    image_path = input("image path >>>>> ")
    if image_path == '':
        print('You did not enter image path, the following will be a plain text conversation.')
        image = None
        text_only_first_query = True
    else:
        image = Image.open(image_path).convert('RGB')

    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        if image is None:
            if text_only_first_query:
                query = text_only_template.format(query)
                text_only_first_query = False
            else:
                old_prompt = ''
                for _, (old_query, response) in enumerate(history):
                    old_prompt += old_query + " " + response + "\n"
                query = old_prompt + "USER: {} ASSISTANT:".format(query)
        if image is None:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                template_version='chat'
            )
        else:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
        }
        gen_kwargs = {
            "max_new_tokens": 2048,
            "pad_token_id": 128002,  
        }
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("<|end_of_text|>")[0]
            print("\nCogVLM2:", response)
        history.append((query, response))


### Prepare

Download CogVLM models using huggingface-cli.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When deploying the CogVLM model using LMDeploy, it is necessary to download the model first, as the CogVLM model repository does not include the tokenizer model.

However, this step is not required for CogVLM2.

Taking one CogVLM model cogvlm-chat-hf as an example, you can prepare it as follows:

huggingface-cli download THUDM/cogvlm-chat-hf --local-dir ./cogvlm-chat-hf --local-dir-use-symlinks False
huggingface-cli download lmsys/vicuna-7b-v1.5 special_tokens_map.json tokenizer.model tokenizer_config.json --local-dir ./cogvlm-chat-hf --local-dir-use-symlinks False

from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('cogvlm-chat-hf', backend_config=PytorchEngineConfig(tp=1, max_prefill_token_num=4096, cache_max_entry_count=0.8))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to set max_prefill_token_num?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary. Will remove later.

Note xformers depends on torch and you should select a version that won't reinstall torch. The following works for `torch==2.2.0`.

```shell
# for torch==2.2.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In openmmlab/lmdeploy docker images, the version of torch is 2.1.0
Should we update it in the Dockerfile?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for torch2.1.0, users could install xformers<0.0.23. As suggested in the docs, users should select a version that won't reinstall torch.
No need to update Dockerfile to torch2.2.0 for torch2.1.0 with triton2.1.0 is desired.

@pseudotensor
Copy link

I see this in logs. Maybe something uninteded going on? It's ok as long as not doing CUDA in a fork.

INFO:     172.16.0.225:19544 - "POST /v1/chat/completions HTTP/1.1" 200 OK
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

@pseudotensor hi, are you using with tp>1? if so, you need to include your code in main function as

from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.vl import load_image

if __name__ == '__main__':
    pipe = pipeline('cogvlm-chat-hf', backend_config=PytorchEngineConfig(tp=2, max_prefill_token_num=4096, cache_max_entry_count=0.8))
    
    image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
    response = pipe(('describe this image', image))
    print(response)

Hi, I'm using the docker image. I'm just reporting what is said in docker logs.

@RunningLeon
Copy link
Collaborator Author

I see this in logs. Maybe something uninteded going on? It's ok as long as not doing CUDA in a fork.

INFO:     172.16.0.225:19544 - "POST /v1/chat/completions HTTP/1.1" 200 OK
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

@pseudotensor hi, are you using with tp>1? if so, you need to include your code in main function as

from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.vl import load_image

if __name__ == '__main__':
    pipe = pipeline('cogvlm-chat-hf', backend_config=PytorchEngineConfig(tp=2, max_prefill_token_num=4096, cache_max_entry_count=0.8))
    
    image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
    response = pipe(('describe this image', image))
    print(response)

Hi, I'm using the docker image. I'm just reporting what is said in docker logs.

@pseudotensor hi, this warning is from huggingface tokenizer. You can safely ignore it. If you want to avoid the warnings, explicitly set the env TOKENIZERS_PARALLELISM=(true | false).

@@ -122,6 +122,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
<li>Mixtral (8x7B, 8x22B)</li>
<li>Gemma (2B - 7B)</li>
<li>Dbrx (132B)</li>
<li>StarCoder2 (3B - 15B)</li>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate starcoder2

@lvhan028 lvhan028 merged commit aa8f7d1 into InternLM:main Jun 4, 2024
4 of 5 checks passed
@pseudotensor
Copy link

congrats!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants