Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: Why the avg. througput generation is low? #4760

Open
rvsh2 opened this issue May 11, 2024 · 0 comments
Open

[Performance]: Why the avg. througput generation is low? #4760

rvsh2 opened this issue May 11, 2024 · 0 comments
Labels
performance Performance-related issues

Comments

@rvsh2
Copy link

rvsh2 commented May 11, 2024

Report of performance regression

Hi I use this:

server_vllm.py \
  --model "/data/models_temp/functionary-small-v2.4/" \
  --served-model-name "functionary" \
  --dtype=bfloat16 \
  --max-model-len 2048 \
  --host 0.0.0.0 \
  --port 8000 \
  --enforce-eager \
  --gpu-memory-utilization 0.94

on rtx 3090 24gb

Why I've got low speed?:
Avg prompt throughput: 102.2 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%

This is my config:

| INFO 05-11 08:17:48 server_vllm.py:473] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name='functionary', grammar_sampling=False, model='/data/models_temp/functionary-small-v2.4/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.94, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
functionary  | You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
functionary  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
functionary  | INFO 05-11 08:17:49 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/data/models_temp/functionary-small-v2.4/', speculative_config=None, tokenizer='/data/models_temp/functionary-small-v2.4/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
functionary  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
functionary  | INFO 05-11 08:17:50 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
functionary  | INFO 05-11 08:17:50 selector.py:28] Using FlashAttention backend.
functionary  | INFO 05-11 08:17:53 model_runner.py:173] Loading model weights took 13.4976 GB
functionary  | INFO 05-11 08:17:53 gpu_executor.py:119] # GPU blocks: 4185, # CPU blocks: 2048
functionary  | INFO:     Started server process [19]
functionary  | INFO:     Waiting for application startup.
functionary  | INFO:     Application startup complete.
functionary  | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
@rvsh2 rvsh2 added the performance Performance-related issues label May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

1 participant