Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

为什么同样的脚本(Qwen1.5/examples/web_demo.py)去执行Qwen1.5-32B-Chat-GPTQ-Int4的推理时,4090 24G 比 V100 32G 回答速度快了5倍,这是什么原因,是显卡性能的问题,还是代码还有哪个配置没有打开,导致V100的计算能力没有发挥出来? #4007

Open
lbl1120 opened this issue May 13, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@lbl1120
Copy link

lbl1120 commented May 13, 2024

为什么同样的脚本(Qwen1.5/examples/web_demo.py)去执行Qwen1.5-32B-Chat-GPTQ-Int4的推理时,4090 24G 比 V100 32G 回答速度快了5倍,这是什么原因,是显卡性能的问题,还是代码还有哪个配置没有打开,导致V100的计算能力没有发挥出来?
Uploading 屏幕截图 2024-05-13 212724.png…

@lbl1120 lbl1120 added the bug Something isn't working label May 13, 2024
Copy link

dosubot bot commented May 13, 2024

The significant difference in response speed you're observing between the 4090 24G and the V100 32G GPUs when running Qwen1.5/examples/web_demo.py with Qwen1.5-32B-Chat-GPTQ-Int4 is primarily due to the inherent performance differences between these two GPUs. This includes variations in CUDA cores, memory bandwidth, and architecture efficiency. The configuration file configs/model_config.py.example does not specify GPU-specific optimizations that would limit the computational power of either GPU, as the device setting is on "auto", allowing the system to select the best available device.

For optimizing GPU performance, especially on the V100, you might want to look into the configs/server_config.py.example for settings that could impact performance, including device configuration and GPU memory utilization. Additionally, exploring GPU optimization techniques recommended for large model inference could provide further improvements. These techniques include PagedAttention, Quantized KV Cache, FlashAttention, Dynamic Batching, and Algorithmic Enhancements, detailed in the large model inference optimization strategies document.

Therefore, the observed performance discrepancy is less about a missed configuration and more about leveraging the specific capabilities and optimizations suitable for each GPU model.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@Zephyr69
Copy link

在显存带宽不是瓶颈时V100是应该远远慢于4090没错呀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants