Skip to content

Latest commit

 

History

History
82 lines (55 loc) · 4.83 KB

token_generation_performance_tips.md

File metadata and controls

82 lines (55 loc) · 4.83 KB

Token generation performance troubleshooting

Verifying that the model is running on the Nvidia GPU with cuBLAS

Make sure to set -DLLAMA_CUBLAS=ON when configuring CMake according to README, and purge the previous build directory before reconfiguring and recompiling.

When PowerInfer utilizes the GPU, it will output diagnostic information that shows whether cuBLAS is offloading work to the GPU. Look for these lines:

llm_load_sparse_model_tensors: using CUDA for GPU acceleration
llm_load_sparse_model_tensors: mem required  = 16825.94 MB
llm_load_sparse_model_tensors: VRAM used: 10183.80 MB

If you see these lines, then the GPU is being used and model tensors are being loaded into VRAM.

Verifying that FFN split is working

Ideally, PowerInfer should be able to utilize full GPU memory or the VRAM budget you set. It first tries to offload dense layers to VRAM (attention, predictor, etc.), then it tries to split hot neurons of the FFN into VRAM if there is still space left.

You can look at this line to see how much FFN has been split and offloaded:

llm_load_gpu_split: offloaded 12577.50 MiB of FFN weights to GPU

If you find that the VRAM usage is much lower than expected, then FFN split is likely not working. Splitting FFN requires solving the neuron placement via powerinfer Python module and loading the generated GPU index file, shown in the following lines.

Solving (the result is cached, so this only happens once):

invoking powerinfer Python module to generate gpu split for 12738.39 MiB of VRAM
solver args: Namespace(activation='/nvme2/huggingface/PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF/activation', neuron=13824, capacity=429432, layer=40, vram_capacity=13357166592, batch=256, threshold=0, output='/nvme2/huggingf
ace/PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf.generated.gpuidx')
...
exported GPU index to /nvme2/huggingface/PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf.generated.gpuidx

Loading generated or cached GPU index:

llama_model_loader: loaded meta data with 3 key-value pairs and 80 tensors from /nvme2/huggingface/PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf.generated.gpuidx (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                    blk.0.gpu_idx i32      [ 13824,     1,     1,     1 ]
...
apply_tensors_to_base_model: applying gpu_idx adapter from '/nvme2/huggingface/PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...

If you don't any of see these lines, then FFN split is not working. It can be caused by:

  • powerinfer Python module is not installed or not activated if you are using a virtual environment
  • There is no activation directory in the model directory, which contains the activation files for solving FFN split

Please refer to Setup and Installation for more information on runtime dependencies and Model Weights for more information on model weights.

Example of runtime flags effect on inference speed benchmark

Please refer to Evaluation for more information on the token generation benchmark on Linux.

For Windows, we have tested PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF on a machine with the following specs:

GPU: Nvidia RTX 2080Ti (11GB VRAM) CPU: Intel i7-13700 RAM: 64GB DDR4, 3200MHz

Run command: .\build\bin\Release\main.exe -m path\to\model -n 64 -p "Once upon a time" [additional benchmark flags]

Result:

command tokens/second (higher is better)
[no additional flags] 4.05
-t 8 4.27

CPU affinity and hybrid architecture

PowerInfer achieves the best performance on CPUs of hybrid architecture (big.LITTLE) when it is running on all CPU performance cores (P cores). On such hardware, we recommend setting -t --threads with the available number of P cores.

We found that Windows sometimes are not able to schedule threads on P cores and forfeits generation performance. If you find the token generation speed unstable, or the utilization of P cores is low, you can try to set CPU affinity manually with Start-Process in PowerShell like this example on 12th Gen Core i7 (8 P cores):

Start-Process -FilePath path\to\main.exe -ArgumentList "-m", "path\to\model", "-t", "8", "-n", "128", "-p", "`"Once upon a time`"" -NoNewWindow -PassThru -Wait | ForEach-Object { $_.ProcessorAffinity = 0x5555 }

It works like taskset on Linux and sets CPU affinity to P cores only (0x5555 is a bit mask for CPU0,2,4,6,8,10,12,14). Please refer to the docs of Start-Process for more details.