You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thank you for developing llamafile, it's such a wonderful tool.
For some time now, llama.cpp on Linux has had support for unified memory architecture (UMA for AMD APU) to share main memory between the CPU and integrated GPU. This requires compiling llama.cpp with -DLLAMA_HIP_UMA=on setting.
I'm trying to compile llamafile with this additional setting for the llama.cpp, but I'm having some problems. Could you point me in the right direction?
I'm using Ubuntu 22.04 with AMD 5600G APU and ROCm 6.1. When I compile llama.cpp I use make LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx900 command. Then I can load any model that fits into my RAM, even though my system reports only 512 MB of VRAM. I also have HSA_OVERRIDE_GFX_VERSION=9.0.0 and HSA_ENABLE_SDMA=0 environment variables set in my .profile file.
I tried adding -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=on -DAMDGPU_TARGETS=gfx900 setting in llamafile-0.8/llamafile/cuda.c file just under "-DGGML_USE_HIPBLAS", in static bool compile_amd_unix(const char *dso, const char *src, const char *tmpdso) { method:
cd llamafile-0.8
make
sudo make install PREFIX=/usr/local
Unfortunately, when I launched the same model that I use with llama.cpp the ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5169.86 MiB on device 0: cudaMalloc failed: out of memory error occurred:
$ llamafile -m Shared/models/TheBloke/CodeLlama-7B-Instruct-GGUF/codellama-7b-instruct.Q6_K.gguf -ngl 9999 --port 8080 --host 0.0.0.0
import_cuda_impl: initializing gpu module...
extracting /zip/llama.cpp/ggml.h to /home/ubuntu/.llamafile/ggml.h
extracting /zip/llamafile/compcap.cu to /home/ubuntu/.llamafile/compcap.cu
extracting /zip/llamafile/llamafile.h to /home/ubuntu/.llamafile/llamafile.h
extracting /zip/llamafile/tinyblas.h to /home/ubuntu/.llamafile/tinyblas.h
extracting /zip/llamafile/tinyblas.cu to /home/ubuntu/.llamafile/tinyblas.cu
extracting /zip/llama.cpp/ggml-impl.h to /home/ubuntu/.llamafile/ggml-impl.h
extracting /zip/llama.cpp/ggml-cuda.h to /home/ubuntu/.llamafile/ggml-cuda.h
extracting /zip/llama.cpp/ggml-alloc.h to /home/ubuntu/.llamafile/ggml-alloc.h
extracting /zip/llama.cpp/ggml-common.h to /home/ubuntu/.llamafile/ggml-common.h
extracting /zip/llama.cpp/ggml-backend.h to /home/ubuntu/.llamafile/ggml-backend.h
extracting /zip/llama.cpp/ggml-backend-impl.h to /home/ubuntu/.llamafile/ggml-backend-impl.h
extracting /zip/llama.cpp/ggml-cuda.cu to /home/ubuntu/.llamafile/ggml-cuda.cu
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
llamafile_log_command: /usr/bin/rocminfo
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=gfx900 -march=native -mtune=native -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=on -DAMDGPU_TARGETS=gfx900 -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/ubuntu/.llamafile/ggml-rocm.so.na8zyf /home/ubuntu/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
/home/ubuntu/.llamafile/ggml-cuda.cu:405:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
405 | }
| ^
/home/ubuntu/.llamafile/ggml-cuda.cu:9182:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
9182 | }
| ^
/home/ubuntu/.llamafile/ggml-cuda.cu:8597:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
8597 | static __global__ void soft_max_f32(const float * x, const float * mask, const float * pos, float * dst, const int ncols_par, const int nrows_y, const float scale, const float max_bias, const float m0, const float m1, uint32_t n_head_log2) {
| ^
/home/ubuntu/.llamafile/ggml-cuda.cu:8597:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/ubuntu/.llamafile/ggml-cuda.cu:8597:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/ubuntu/.llamafile/ggml-cuda.cu:8597:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/ubuntu/.llamafile/ggml-cuda.cu:8597:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/ubuntu/.llamafile/ggml-cuda.cu:8597:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
8 warnings generated when compiling for gfx900.
/home/ubuntu/.llamafile/ggml-cuda.cu:405:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
405 | }
| ^
/home/ubuntu/.llamafile/ggml-cuda.cu:9182:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
9182 | }
| ^
2 warnings generated when compiling for host.
link_cuda_dso: note: dynamically linking /home/ubuntu/.llamafile/ggml-rocm.so
ggml_cuda_link: welcome to ROCm SDK with hipBLAS
link_cuda_dso: GPU support loaded
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2859,"msg":"build info","tid":"9434528","timestamp":1714076061}
{"function":"server_cli","level":"INFO","line":2862,"msg":"system info","n_threads":6,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"9434528","timestamp":1714076061,"total_threads":12}
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from Shared/models/TheBloke/CodeLlama-7B-Instruct-GGUF/codellama-7b-instruct.Q6_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = codellama_codellama-7b-instruct-hf
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 18
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q6_K: 226 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32016
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q6_K
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 5.15 GiB (6.56 BPW)
llm_load_print_meta: general.name = codellama_codellama-7b-instruct-hf
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, compute capability 9.0, VMM: no
llm_load_tensors: ggml ctx size = 0.30 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5169.86 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'Shared/models/TheBloke/CodeLlama-7B-Instruct-GGUF/codellama-7b-instruct.Q6_K.gguf'
{"function":"load_model","level":"ERR","line":447,"model":"Shared/models/TheBloke/CodeLlama-7B-Instruct-GGUF/codellama-7b-instruct.Q6_K.gguf","msg":"unable to load model","tid":"9434528","timestamp":1714076061}
When I run llama.cpp compiled with UMA support, everything works fine:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, thank you for developing llamafile, it's such a wonderful tool.
For some time now, llama.cpp on Linux has had support for unified memory architecture (UMA for AMD APU) to share main memory between the CPU and integrated GPU. This requires compiling llama.cpp with
-DLLAMA_HIP_UMA=on
setting.I'm trying to compile llamafile with this additional setting for the llama.cpp, but I'm having some problems. Could you point me in the right direction?
I'm using Ubuntu 22.04 with AMD 5600G APU and ROCm 6.1. When I compile llama.cpp I use
make LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx900
command. Then I can load any model that fits into my RAM, even though my system reports only 512 MB of VRAM. I also haveHSA_OVERRIDE_GFX_VERSION=9.0.0
andHSA_ENABLE_SDMA=0
environment variables set in my.profile
file.I tried adding
-DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=on -DAMDGPU_TARGETS=gfx900
setting inllamafile-0.8/llamafile/cuda.c
file just under"-DGGML_USE_HIPBLAS",
instatic bool compile_amd_unix(const char *dso, const char *src, const char *tmpdso) {
method:And in
llamafile-0.8/llamafile/rocm.sh
file just under-DGGML_USE_HIPBLAS \
:Then I compiled llamafile:
Unfortunately, when I launched the same model that I use with llama.cpp the
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5169.86 MiB on device 0: cudaMalloc failed: out of memory
error occurred:When I run llama.cpp compiled with UMA support, everything works fine:
Should I add
-DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=on -DAMDGPU_TARGETS=gfx900
for llama.cpp somewhere else? I would be very grateful for help.Beta Was this translation helpful? Give feedback.
All reactions