Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Android/Termux] Significantly higher RAM usage with Vulkan compared to CPU only #7351

Open
egeoz opened this issue May 17, 2024 · 3 comments

Comments

@egeoz
Copy link

egeoz commented May 17, 2024

I have managed to get Vulkan working in the Termux environment on my Samsung Galaxy S24+ (Exynos 2400 and Xclipse 940), and I have been experimenting with LLMs on LLama.cpp. While the performance improvement is excellent for both inference and processing, I am experiencing significantly higher RAM usage with Vulkan enabled, to the point where the device starts to aggressively swap out anything it can. The output is not garbled with Vulkan, so I do not think that the issue is with Vulkan drivers of my device. Since my phone is not rooted, I am also unable to see the memory usage of individual processes, but both instances were run with nothing in the background and right after one another.

Vulkan

Run command:
$ ./main -m ../models/gemma-1.1-2b-it-Q6_K.gguf -ngl 50 -c 4096 --no-mmap -i

Memory:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            10Gi       9.9Gi       203Mi       3.0Mi       915Mi       894Mi
Swap:          8.0Gi       1.6Gi       6.4Gi

Benchmark with -n 100:

llama_print_timings:        load time =    9958.81 ms
llama_print_timings:      sample time =      51.08 ms /   100 runs   (    0.51 ms per token,  1957.64 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)                                       
llama_print_timings:        eval time =    5877.33 ms /   100 runs   (   58.77 ms per token,    17.01 tokens per second)
llama_print_timings:       total time =    6266.68 ms /   100 tokens

CPU

Run command:
$ ./main -m ../models/gemma-1.1-2b-it-Q6_K.gguf -c 4096 --no-mmap -i

Memory:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            10Gi       6.0Gi       204Mi       8.0Mi       4.7Gi       4.7Gi
Swap:          8.0Gi       458Mi       7.6Gi

Benchmark with -n 100:

llama_print_timings:        load time =    1545.39 ms
llama_print_timings:      sample time =      14.47 ms /   100 runs   (    0.14 ms per token,  6912.76 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =   12535.73 ms /   100 runs   (  125.36 ms per token,     7.98 tokens per second)
llama_print_timings:       total time =   12666.80 ms /   100 tokens

Please let me know if I can provide any other information.

@smilingOrange
Copy link

can you show the steps you used to get llama.cpp with Vulkan working in termux?

@egeoz
Copy link
Author

egeoz commented May 19, 2024

can you show the steps you used to get llama.cpp with Vulkan working in termux?

I've downloaded the latest artifact from the following link, installed mesa-zink from tur-repo and enabled zink with GALLIUM_DRIVER=zink environment variable.
https://github.com/termux/termux-packages/actions?query=branch%3Adev%2Fsysvk++
Though, I suspect it only worked properly for me because of the Xclipse GPU. I recall seeing some issues here regarding Adreno Vulkan implementation.

@Jeximo
Copy link
Contributor

Jeximo commented May 20, 2024

I recall seeing some issues here regarding Adreno Vulkan implementation.

It's not implemented.

Related: #6395 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants