You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, how to test tensorrt-llm serving correctly? I've tested on llama2-8b-chat and llama3-8b and the performances are too bad for TTFT. Could you tell me where goes wrong? THX
Proposal to improve performance
Hi, how to test tensorrt-llm serving correctly? I've tested on llama2-8b-chat and llama3-8b and the performances are too bad for TTFT. Could you tell me where goes wrong? THX
I use docker image
nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
and follow this doc https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.mdThis is the results for request rate=7
related issue:
triton-inference-server/tensorrtllm_backend#453
Report of performance regression
ran script:
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The text was updated successfully, but these errors were encountered: