[Feature]: Support cogvlm-chat #1502

RunningLeon · 2024-04-26T03:40:01Z

Motivation

Support cogvlm-chat-hf and CogVLM2 for pytorch engine

Usage:

Warning

CogVLM-Chat-hf uses 'lmsys/vicuna-7b-v1.5' as tokenizer, you need to copy the tokenizer model and configs into CogVLM model directory.

from lmdeploy import pipeline
from lmdeploy.vl import load_image

model_path = './models--THUDM--cogvlm-chat-hf'

pipe = pipeline(model_path)

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Modification

TODOs

Suppot tp
Support ModelInputs.split with vision embeddings
Resolve conflicts with main branch
Support loading only llm part for vlm in pytorch engine
documents
Update after PR [Feature] Support vl models quantization #1553 and Balance vision model weights on multi gpus #1591

BC-breaking (Optional)

Profiling

tp=1 batch_size=128 num-prompts=3000

cogvlm-chat-hf

Without images

concurrency: 128
elapsed_time: 282.105s

first token latency(s)(min, max, ave): 0.859, 5.820, 1.471
per-token latency(s) percentile(50, 75, 95, 99): [0.032, 0.033, 0.111, 0.237]

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 2571.792 token/s
token throughput (prompt + completion token): 5260.009 token/s
RPS (request per second): 10.634 req/s
RPM (request per minute): 638.060 req/min

with one image

change profile_throught.py and prepend 1234 tokens and image embeddings to each prompt

concurrency: 128
elapsed_time: 1066.815s

first token latency(s)(min, max, ave): 0.881, 39.275, 31.872
per-token latency(s) percentile(50, 75, 95, 99): [0.033, 0.036, 0.266, 0.334]

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 680.077 token/s
token throughput (prompt + completion token): 1390.940 token/s
RPS (request per second): 2.812 req/s
RPM (request per minute): 168.726 req/min

REST API

using PR #1662

concurrency: 16
elapsed_time: 447.581s

first_token latency(min, max, ave): 0.055s, 7.081s, 1.278s

number of prompt tokens: 248339
number of completion tokens: 240582
token throughput (completion token): 537.516 token/s
token throughput (prompt + completion token): 1092.364 token/s
RPS (request per second): 2.234 req/s
RPM (request per minute): 134.054 req/min

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

lmdeploy/pytorch/config.py

lmdeploy/pytorch/engine/engine.py

lvhan028 · 2024-05-31T07:13:18Z

docs/en/multi_modal/cogvlm.md

+Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeploy.readthedocs.io/en/latest/get_started.html#installation) for more.
+
+```shell
+pip install lmdeploy


xformers should be installed

I installed xformers. It will install torch 2.3.0
But lmdeploy requires torch<=2.2.2,>=2.0.0

We'd better guide users about installing xformers before lmdeploy.

lvhan028 · 2024-05-31T08:56:15Z

@zhulinJulia24 please add cogvlm and cogvlm2 into test cases

lmdeploy/vl/model/cogvlm.py

pseudotensor · 2024-05-31T23:29:10Z

Curious if you guys would know why cogvlm2 would hit these kinds of issues with transformers. I don't have a repro, but also noticed the same thing in sglang. I'm wondering if I'm misusing the mode or something.
THUDM/CogVLM2#68
The server code is just fastapi wrapper (single thread at a time) with transformers:
https://github.com/h2oai/h2ogpt/blob/main/openai_server/cogvlm2_server/cogvlm2.py
This is just based upon their code:
https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_demo.py
client side: https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_request.py
Many things work, then just hits issues and all dead.
Doesn't seem to be GPU OOM, because 40GB/80GB still left.
I bring it up, because I'd like to use lmdeploy for cogvlm2, but I'm worried something is not right.

@pseudotensor hi, maybe you could try with lmdeploy using the latest code from this PR to see if it happens. You can refer to this doc: https://lmdeploy.readthedocs.io/en/latest/serving/api_server_vl.html#launch-service

Ok, trying now. First building docker image from this PR:

docker build . -f docker/Dockerfile -t cogvlm2 --no-cache

pseudotensor · 2024-06-01T00:41:34Z

I modified the docker llava-like thing for this cogvlm2 case:

# git clone https://github.com/InternLM/lmdeploy.git
# cd lmdeploy
# git fetch origin pull/1502/head:pr-1502
# git checkout pr-1502
# docker build . -f docker/Dockerfile -t cogvlm2
# cd ~/h2ogpt_ops
# docker build - < Dockerfile.cogvlm2 -t cogvlm2_internalvl

FROM cogvlm2:latest

RUN apt-get update && apt-get install -y python3 python3-pip git

WORKDIR /app

RUN pip3 install --upgrade pip
RUN pip3 install timm xformers triton==2.2.0
RUN pip3 install git+https://github.com/haotian-liu/LLaVA.git --no-deps

COPY . .

CMD ["lmdeploy", "serve", "api_server", "THUDM/cogvlm2-llama3-chat-19B"]

And notice on startup:

2024-06-01 00:39:16,053 - lmdeploy - WARNING - Fallback to pytorch engine because `/root/.cache/huggingface/hub/models--THUDM--cogvlm2-llama3-chat-19B/snapshots/2bf7de6892877eb50142395af14847519ba95998` not supported by turbomind engine.

is that ok?

pseudotensor · 2024-06-01T00:51:14Z

A quite strange response the transformers use of the model doesn't do. Seems like the prompting is off. This is with no image:

Another funny one:

Another bad one:

It isn't always bad, but something seems off. Never noticed such oddities with the cogvlm2 demos locally.

But if I pass an image it responds ok:

pseudotensor · 2024-06-01T01:19:21Z

I see this in logs. Maybe something uninteded going on? It's ok as long as not doing CUDA in a fork.

INFO:     172.16.0.225:19544 - "POST /v1/chat/completions HTTP/1.1" 200 OK
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

RunningLeon · 2024-06-03T02:13:44Z

I see this in logs. Maybe something uninteded going on? It's ok as long as not doing CUDA in a fork.

INFO:     172.16.0.225:19544 - "POST /v1/chat/completions HTTP/1.1" 200 OK
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

@pseudotensor hi, are you using with tp>1? if so, you need to include your code in main function as

from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.vl import load_image

if __name__ == '__main__':
    pipe = pipeline('cogvlm-chat-hf', backend_config=PytorchEngineConfig(tp=2, max_prefill_token_num=4096, cache_max_entry_count=0.8))
    
    image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
    response = pipe(('describe this image', image))
    print(response)

RunningLeon · 2024-06-03T04:14:00Z

A quite strange response the transformers use of the model doesn't do. Seems like the prompting is off. This is with no image:

Another funny one:

Another bad one:

It isn't always bad, but something seems off. Never noticed such oddities with the cogvlm2 demos locally.

But if I pass an image it responds ok:

@pseudotensor hi, in cogvlm2's demo. The prompt is wrapped with text only template for a session without image input as in here. Can you try again with 7080da2 .

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
).to(DEVICE).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:
    image_path = input("image path >>>>> ")
    if image_path == '':
        print('You did not enter image path, the following will be a plain text conversation.')
        image = None
        text_only_first_query = True
    else:
        image = Image.open(image_path).convert('RGB')

    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        if image is None:
            if text_only_first_query:
                query = text_only_template.format(query)
                text_only_first_query = False
            else:
                old_prompt = ''
                for _, (old_query, response) in enumerate(history):
                    old_prompt += old_query + " " + response + "\n"
                query = old_prompt + "USER: {} ASSISTANT:".format(query)
        if image is None:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                template_version='chat'
            )
        else:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
        }
        gen_kwargs = {
            "max_new_tokens": 2048,
            "pad_token_id": 128002,  
        }
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("<|end_of_text|>")[0]
            print("\nCogVLM2:", response)
        history.append((query, response))

docs/en/multi_modal/cogvlm.md

lvhan028 · 2024-06-03T09:08:17Z

docs/en/multi_modal/cogvlm.md

+
+### Prepare
+
+Download CogVLM models using huggingface-cli.


When deploying the CogVLM model using LMDeploy, it is necessary to download the model first, as the CogVLM model repository does not include the tokenizer model.

However, this step is not required for CogVLM2.

Taking one CogVLM model cogvlm-chat-hf as an example, you can prepare it as follows:

huggingface-cli download THUDM/cogvlm-chat-hf --local-dir ./cogvlm-chat-hf --local-dir-use-symlinks False huggingface-cli download lmsys/vicuna-7b-v1.5 special_tokens_map.json tokenizer.model tokenizer_config.json --local-dir ./cogvlm-chat-hf --local-dir-use-symlinks False

lvhan028 · 2024-06-03T09:09:14Z

docs/en/multi_modal/cogvlm.md

+from lmdeploy import pipeline, PytorchEngineConfig
+from lmdeploy.vl import load_image
+
+pipe = pipeline('cogvlm-chat-hf', backend_config=PytorchEngineConfig(tp=1, max_prefill_token_num=4096, cache_max_entry_count=0.8))


Do we have to set max_prefill_token_num?

Not necessary. Will remove later.

lvhan028 · 2024-06-03T09:25:59Z

docs/en/multi_modal/cogvlm.md

+Note xformers depends on torch and you should select a version that won't reinstall torch. The following works for `torch==2.2.0`.
+
+```shell
+# for torch==2.2.0


In openmmlab/lmdeploy docker images, the version of torch is 2.1.0
Should we update it in the Dockerfile?

for torch2.1.0, users could install xformers<0.0.23. As suggested in the docs, users should select a version that won't reinstall torch.
No need to update Dockerfile to torch2.2.0 for torch2.1.0 with triton2.1.0 is desired.

pseudotensor · 2024-06-03T16:34:36Z

I see this in logs. Maybe something uninteded going on? It's ok as long as not doing CUDA in a fork.

INFO:     172.16.0.225:19544 - "POST /v1/chat/completions HTTP/1.1" 200 OK
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

@pseudotensor hi, are you using with tp>1? if so, you need to include your code in main function as

from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.vl import load_image

if __name__ == '__main__':
    pipe = pipeline('cogvlm-chat-hf', backend_config=PytorchEngineConfig(tp=2, max_prefill_token_num=4096, cache_max_entry_count=0.8))
    
    image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
    response = pipe(('describe this image', image))
    print(response)

Hi, I'm using the docker image. I'm just reporting what is said in docker logs.

RunningLeon · 2024-06-04T02:30:08Z

I see this in logs. Maybe something uninteded going on? It's ok as long as not doing CUDA in a fork.

INFO:     172.16.0.225:19544 - "POST /v1/chat/completions HTTP/1.1" 200 OK
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

@pseudotensor hi, are you using with tp>1? if so, you need to include your code in main function as

from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.vl import load_image

if __name__ == '__main__':
    pipe = pipeline('cogvlm-chat-hf', backend_config=PytorchEngineConfig(tp=2, max_prefill_token_num=4096, cache_max_entry_count=0.8))
    
    image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
    response = pipe(('describe this image', image))
    print(response)

Hi, I'm using the docker image. I'm just reporting what is said in docker logs.

@pseudotensor hi, this warning is from huggingface tokenizer. You can safely ignore it. If you want to avoid the warnings, explicitly set the env TOKENIZERS_PARALLELISM=(true | false).

lvhan028 · 2024-06-04T06:31:39Z

README.md

@@ -122,6 +122,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
  <li>Mixtral (8x7B, 8x22B)</li>
  <li>Gemma (2B - 7B)</li>
  <li>Dbrx (132B)</li>
+  <li>StarCoder2 (3B - 15B)</li>


duplicate starcoder2

pseudotensor · 2024-06-04T06:44:58Z

congrats!

RunningLeon added 5 commits April 24, 2024 16:07

support cogvlm

3fb26a1

fix

d6ae02d

fix position id

960792f

simplify rewritings

31fc433

refactor to move compute inside model forward

8be73aa

RunningLeon added the WIP label Apr 26, 2024

RunningLeon added 5 commits May 8, 2024 21:20

fix conflicts with main

c1952b8

fix tp

c984d42

remove vision encoder

a57051e

update vision embedding split

2cff2ea

update docs

e20d312

RunningLeon changed the title ~~[WIP]: Support cogvlm~~ [WIP]: Support cogvlm-chat May 10, 2024

RunningLeon added 5 commits May 10, 2024 20:29

copy tokenizer for cogvlm

61796dc

fix position ids

05c9b20

add emb indexing to vision model inputs

e8e0be0

update history embedding to enable set step

23c8060

remove print

29592f3

RunningLeon force-pushed the support-cogvlm-dev branch from 0e4befd to 29592f3 Compare May 15, 2024 11:10

Merge remote-tracking branch 'upstream/main' into support-cogvlm-dev

6b3752e

RunningLeon marked this pull request as ready for review May 15, 2024 12:24

RunningLeon changed the title ~~[WIP]: Support cogvlm-chat~~ [Feature]: Support cogvlm-chat May 15, 2024

RunningLeon removed the WIP label May 15, 2024

reorganize config

55eb8f5

RunningLeon mentioned this pull request May 21, 2024

[Feature] Implement COG-VLM2 #1622

Open

RunningLeon requested a review from grimoire May 23, 2024 03:43

RunningLeon mentioned this pull request May 23, 2024

[Feature]: Support llava for pytorch engine #1641

Open

grimoire reviewed May 23, 2024

View reviewed changes

lmdeploy/pytorch/config.py Show resolved Hide resolved

grimoire reviewed May 23, 2024

View reviewed changes

lmdeploy/pytorch/engine/engine.py Outdated Show resolved Hide resolved

grimoire reviewed May 23, 2024

View reviewed changes

lmdeploy/pytorch/engine/engine.py Show resolved Hide resolved

grimoire reviewed May 23, 2024

View reviewed changes

lmdeploy/pytorch/engine/engine.py Show resolved Hide resolved

lvhan028 reviewed May 31, 2024

View reviewed changes

irexyc reviewed May 31, 2024

View reviewed changes

lmdeploy/vl/model/cogvlm.py Outdated Show resolved Hide resolved

resolve comments

bfa7267

RunningLeon added 2 commits June 3, 2024 12:29

fix cogvlm template of pure text input in the first conversation

7080da2

fix multi turn conversation for vlm

4d795f8

RunningLeon requested review from irexyc and grimoire June 3, 2024 07:29

lvhan028 reviewed Jun 3, 2024

View reviewed changes

docs/en/multi_modal/cogvlm.md Outdated Show resolved Hide resolved

lvhan028 reviewed Jun 3, 2024

View reviewed changes

update docs

c516afd

RunningLeon force-pushed the support-cogvlm-dev branch from 0afe31f to c516afd Compare June 3, 2024 10:47

update docs

88b4d2c

lvhan028 approved these changes Jun 4, 2024

View reviewed changes

grimoire approved these changes Jun 4, 2024

View reviewed changes

AllentDan approved these changes Jun 4, 2024

View reviewed changes

update readme

7174150

lvhan028 reviewed Jun 4, 2024

View reviewed changes

lvhan028 merged commit aa8f7d1 into InternLM:main Jun 4, 2024
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support cogvlm-chat #1502

[Feature]: Support cogvlm-chat #1502

RunningLeon commented Apr 26, 2024 •

edited

lvhan028 May 31, 2024

lvhan028 May 31, 2024

lvhan028 Jun 3, 2024 •

edited

lvhan028 commented May 31, 2024

pseudotensor commented May 31, 2024 •

edited

pseudotensor commented Jun 1, 2024

pseudotensor commented Jun 1, 2024 •

edited

pseudotensor commented Jun 1, 2024 •

edited

RunningLeon commented Jun 3, 2024 •

edited

RunningLeon commented Jun 3, 2024 •

edited

lvhan028 Jun 3, 2024

lvhan028 Jun 3, 2024

RunningLeon Jun 3, 2024

lvhan028 Jun 3, 2024

RunningLeon Jun 3, 2024 •

edited

pseudotensor commented Jun 3, 2024

RunningLeon commented Jun 4, 2024

lvhan028 Jun 4, 2024

pseudotensor commented Jun 4, 2024

[Feature]: Support cogvlm-chat #1502

[Feature]: Support cogvlm-chat #1502

Conversation

RunningLeon commented Apr 26, 2024 • edited

Motivation

Modification

BC-breaking (Optional)

Profiling

tp=1 batch_size=128 num-prompts=3000

cogvlm-chat-hf

Without images

with one image

REST API

Use cases (Optional)

Checklist

lvhan028 May 31, 2024

Choose a reason for hiding this comment

lvhan028 May 31, 2024

Choose a reason for hiding this comment

lvhan028 Jun 3, 2024 • edited

Choose a reason for hiding this comment

lvhan028 commented May 31, 2024

pseudotensor commented May 31, 2024 • edited

pseudotensor commented Jun 1, 2024

pseudotensor commented Jun 1, 2024 • edited

pseudotensor commented Jun 1, 2024 • edited

RunningLeon commented Jun 3, 2024 • edited

RunningLeon commented Jun 3, 2024 • edited

lvhan028 Jun 3, 2024

Choose a reason for hiding this comment

lvhan028 Jun 3, 2024

Choose a reason for hiding this comment

RunningLeon Jun 3, 2024

Choose a reason for hiding this comment

lvhan028 Jun 3, 2024

Choose a reason for hiding this comment

RunningLeon Jun 3, 2024 • edited

Choose a reason for hiding this comment

pseudotensor commented Jun 3, 2024

RunningLeon commented Jun 4, 2024

lvhan028 Jun 4, 2024

Choose a reason for hiding this comment

pseudotensor commented Jun 4, 2024

RunningLeon commented Apr 26, 2024 •

edited

lvhan028 Jun 3, 2024 •

edited

pseudotensor commented May 31, 2024 •

edited

pseudotensor commented Jun 1, 2024 •

edited

pseudotensor commented Jun 1, 2024 •

edited

RunningLeon commented Jun 3, 2024 •

edited

RunningLeon commented Jun 3, 2024 •

edited

RunningLeon Jun 3, 2024 •

edited