为什么同样的脚本（Qwen1.5/examples/web_demo.py）去执行Qwen1.5-32B-Chat-GPTQ-Int4的推理时，4090 24G 比 V100 32G 回答速度快了5倍，这是什么原因，是显卡性能的问题，还是代码还有哪个配置没有打开，导致V100的计算能力没有发挥出来？ #4007

lbl1120 · 2024-05-13T13:30:34Z

为什么同样的脚本（Qwen1.5/examples/web_demo.py）去执行Qwen1.5-32B-Chat-GPTQ-Int4的推理时，4090 24G 比 V100 32G 回答速度快了5倍，这是什么原因，是显卡性能的问题，还是代码还有哪个配置没有打开，导致V100的计算能力没有发挥出来？

dosubot · 2024-05-13T13:31:03Z

The significant difference in response speed you're observing between the 4090 24G and the V100 32G GPUs when running Qwen1.5/examples/web_demo.py with Qwen1.5-32B-Chat-GPTQ-Int4 is primarily due to the inherent performance differences between these two GPUs. This includes variations in CUDA cores, memory bandwidth, and architecture efficiency. The configuration file configs/model_config.py.example does not specify GPU-specific optimizations that would limit the computational power of either GPU, as the device setting is on "auto", allowing the system to select the best available device.

For optimizing GPU performance, especially on the V100, you might want to look into the configs/server_config.py.example for settings that could impact performance, including device configuration and GPU memory utilization. Additionally, exploring GPU optimization techniques recommended for large model inference could provide further improvements. These techniques include PagedAttention, Quantized KV Cache, FlashAttention, Dynamic Batching, and Algorithmic Enhancements, detailed in the large model inference optimization strategies document.

Therefore, the observed performance discrepancy is less about a missed configuration and more about leveraging the specific capabilities and optimizations suitable for each GPU model.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Zephyr69 · 2024-05-14T00:18:09Z

在显存带宽不是瓶颈时V100是应该远远慢于4090没错呀

lbl1120 added the bug Something isn't working label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

为什么同样的脚本（Qwen1.5/examples/web_demo.py）去执行Qwen1.5-32B-Chat-GPTQ-Int4的推理时，4090 24G 比 V100 32G 回答速度快了5倍，这是什么原因，是显卡性能的问题，还是代码还有哪个配置没有打开，导致V100的计算能力没有发挥出来？ #4007

为什么同样的脚本（Qwen1.5/examples/web_demo.py）去执行Qwen1.5-32B-Chat-GPTQ-Int4的推理时，4090 24G 比 V100 32G 回答速度快了5倍，这是什么原因，是显卡性能的问题，还是代码还有哪个配置没有打开，导致V100的计算能力没有发挥出来？ #4007

lbl1120 commented May 13, 2024

dosubot bot commented May 13, 2024 •

edited

Details

Zephyr69 commented May 14, 2024

Comments

lbl1120 commented May 13, 2024

dosubot bot commented May 13, 2024 • edited

Details

Zephyr69 commented May 14, 2024

dosubot bot commented May 13, 2024 •

edited