Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncEngine 的 stream_infer 函数增加手动传入session_id,实现多次调用 stream_infer 时的并行推理[Feature] #1590

Open
NagatoYuki0943 opened this issue May 14, 2024 · 2 comments

Comments

@NagatoYuki0943
Copy link

NagatoYuki0943 commented May 14, 2024

Motivation

我自己手动实例化pipe之后使用pipe的stream_infer实现模型推理,代码如下

response = ""
for _response in self.pipe.stream_infer(
    prompts = prompts,
    gen_config = self.gen_config,
    do_preprocess = True,
    adapter_name = None
):
    response += _response.text
    yield response, history + [[query, response]]

之后自己搭建gradio界面进行推理,gradio实现了多线程访问,在使用 transformers 库进行推理时,实现了多个用户同时和模型对话,但是在使用 lmdeploy 部署时发现对话是串行的,同一时刻只能有一个用户进行对话。

排查后发现是 AsyncEngine 的 stream_infer 函数中在调用 generate 时传递的 session_id 为对话列表的 index,所以同一时刻使用这个函数时,多个对话的 session_id 都为0,因此变为了串行的方式推理。
这一行传递了固定的列表 index https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/serve/async_engine.py#L495

我将传入的 i 修改为了 随机数,就可以实现并行推理了,不过这样修改似乎并不优雅,希望提供手动传入 session_id 的方式

        for i, prompt in enumerate(prompts):
            generators.append(
                self.generate(prompt,
                              random.randint(0, 1e9), # 这一行原本是 i,我改为了 random 模块的 randint
                              gen_config=gen_config[i],
                              stream_response=True,
                              sequence_start=True,
                              sequence_end=True,
                              do_preprocess=do_preprocess,
                              adapter_name=adapter_name,
                              **kwargs))

Related resources

这是我实现推理的完整代码
https://github.com/NagatoYuki0943/xlab-huanhuan/blob/master/lmdeploy/turbomind_gradio_stream.py
https://github.com/NagatoYuki0943/xlab-huanhuan/blob/master/lmdeploy/infer_engine.py

Additional context

No response

@irexyc
Copy link
Collaborator

irexyc commented May 14, 2024

我能想到的两个比较简单的方式。

1:不要封装pipeline.stream_infer,封装更底层instance的stream_infer。pipeline.engine.create_instance().stream_infer。turbomind这边对应的是这里
2. 封装pipeline.generate。不过多线程调用协程,然后还要流式输出的话,可能需要用线程包一下。或许也有更好的方式 ? 如果不需要流式输出,可以直接用pipeline.chat

@NagatoYuki0943
Copy link
Author

好的,谢谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants