Compiling llamafile with UMA support in llama.cpp (for AMD APU) #366

qkiel · 2024-04-25T20:55:27Z

qkiel
Apr 25, 2024

Hi, thank you for developing llamafile, it's such a wonderful tool.

For some time now, llama.cpp on Linux has had support for unified memory architecture (UMA for AMD APU) to share main memory between the CPU and integrated GPU. This requires compiling llama.cpp with -DLLAMA_HIP_UMA=on setting.

I'm trying to compile llamafile with this additional setting for the llama.cpp, but I'm having some problems. Could you point me in the right direction?

I'm using Ubuntu 22.04 with AMD 5600G APU and ROCm 6.1. When I compile llama.cpp I use make LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx900 command. Then I can load any model that fits into my RAM, even though my system reports only 512 MB of VRAM. I also have HSA_OVERRIDE_GFX_VERSION=9.0.0 and HSA_ENABLE_SDMA=0 environment variables set in my .profile file.

I tried adding -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=on -DAMDGPU_TARGETS=gfx900 setting in llamafile-0.8/llamafile/cuda.c file just under "-DGGML_USE_HIPBLAS", in static bool compile_amd_unix(const char *dso, const char *src, const char *tmpdso) { method:

    char *args[] = {
        "hipcc",
        "-O3",
        "-fPIC",
        "-shared",
        "-DNDEBUG",
        offload_arch,
        "-march=native",
        "-mtune=native",
        "-DGGML_BUILD=1",
        "-DGGML_SHARED=1",
        "-Wno-return-type",
        "-Wno-unused-result",
        "-DGGML_USE_HIPBLAS",
        "-DLLAMA_HIPBLAS=on",
        "-DLLAMA_HIP_UMA=on",
        "-DAMDGPU_TARGETS=gfx900",

And in llamafile-0.8/llamafile/rocm.sh file just under -DGGML_USE_HIPBLAS \:

hipcc \
  -O3 \
  -fPIC \
  -shared \
  -DNDEBUG \
  -march=native \
  -mtune=native \
  -use_fast_math \
  -DGGML_BUILD=1 \
  -DGGML_SHARED=1 \
  -Wno-return-type \
  -Wno-unused-result \
  -DGGML_USE_HIPBLAS \
  -DLLAMA_HIPBLAS=on \
  -DLLAMA_HIP_UMA=on \
  -DAMDGPU_TARGETS=gfx900 \

Then I compiled llamafile:

cd llamafile-0.8
make
sudo make install PREFIX=/usr/local

Unfortunately, when I launched the same model that I use with llama.cpp the ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5169.86 MiB on device 0: cudaMalloc failed: out of memory error occurred:

$ llamafile -m Shared/models/TheBloke/CodeLlama-7B-Instruct-GGUF/codellama-7b-instruct.Q6_K.gguf -ngl 9999 --port 8080 --host 0.0.0.0
import_cuda_impl: initializing gpu module...
extracting /zip/llama.cpp/ggml.h to /home/ubuntu/.llamafile/ggml.h
extracting /zip/llamafile/compcap.cu to /home/ubuntu/.llamafile/compcap.cu
extracting /zip/llamafile/llamafile.h to /home/ubuntu/.llamafile/llamafile.h
extracting /zip/llamafile/tinyblas.h to /home/ubuntu/.llamafile/tinyblas.h
extracting /zip/llamafile/tinyblas.cu to /home/ubuntu/.llamafile/tinyblas.cu
extracting /zip/llama.cpp/ggml-impl.h to /home/ubuntu/.llamafile/ggml-impl.h
extracting /zip/llama.cpp/ggml-cuda.h to /home/ubuntu/.llamafile/ggml-cuda.h
extracting /zip/llama.cpp/ggml-alloc.h to /home/ubuntu/.llamafile/ggml-alloc.h
extracting /zip/llama.cpp/ggml-common.h to /home/ubuntu/.llamafile/ggml-common.h
extracting /zip/llama.cpp/ggml-backend.h to /home/ubuntu/.llamafile/ggml-backend.h
extracting /zip/llama.cpp/ggml-backend-impl.h to /home/ubuntu/.llamafile/ggml-backend-impl.h
extracting /zip/llama.cpp/ggml-cuda.cu to /home/ubuntu/.llamafile/ggml-cuda.cu
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
llamafile_log_command: /usr/bin/rocminfo
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=gfx900 -march=native -mtune=native -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=on -DAMDGPU_TARGETS=gfx900 -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/ubuntu/.llamafile/ggml-rocm.so.na8zyf /home/ubuntu/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
/home/ubuntu/.llamafile/ggml-cuda.cu:405:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
  405 | }
      | ^
/home/ubuntu/.llamafile/ggml-cuda.cu:9182:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
 9182 | }
      | ^
/home/ubuntu/.llamafile/ggml-cuda.cu:8597:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 8597 | static __global__ void soft_max_f32(const float * x, const float * mask, const float * pos, float * dst, const int ncols_par, const int nrows_y, const float scale, const float max_bias, const float m0, const float m1, uint32_t n_head_log2) {
      |                        ^
/home/ubuntu/.llamafile/ggml-cuda.cu:8597:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/ubuntu/.llamafile/ggml-cuda.cu:8597:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/ubuntu/.llamafile/ggml-cuda.cu:8597:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/ubuntu/.llamafile/ggml-cuda.cu:8597:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/ubuntu/.llamafile/ggml-cuda.cu:8597:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
8 warnings generated when compiling for gfx900.
/home/ubuntu/.llamafile/ggml-cuda.cu:405:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
  405 | }
      | ^
/home/ubuntu/.llamafile/ggml-cuda.cu:9182:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
 9182 | }
      | ^
2 warnings generated when compiling for host.
link_cuda_dso: note: dynamically linking /home/ubuntu/.llamafile/ggml-rocm.so
ggml_cuda_link: welcome to ROCm SDK with hipBLAS
link_cuda_dso: GPU support loaded
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2859,"msg":"build info","tid":"9434528","timestamp":1714076061}
{"function":"server_cli","level":"INFO","line":2862,"msg":"system info","n_threads":6,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"9434528","timestamp":1714076061,"total_threads":12}
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from Shared/models/TheBloke/CodeLlama-7B-Instruct-GGUF/codellama-7b-instruct.Q6_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama_codellama-7b-instruct-hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 18
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 5.15 GiB (6.56 BPW) 
llm_load_print_meta: general.name     = codellama_codellama-7b-instruct-hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 9.0, VMM: no
llm_load_tensors: ggml ctx size =    0.30 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5169.86 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'Shared/models/TheBloke/CodeLlama-7B-Instruct-GGUF/codellama-7b-instruct.Q6_K.gguf'
{"function":"load_model","level":"ERR","line":447,"model":"Shared/models/TheBloke/CodeLlama-7B-Instruct-GGUF/codellama-7b-instruct.Q6_K.gguf","msg":"unable to load model","tid":"9434528","timestamp":1714076061}

When I run llama.cpp compiled with UMA support, everything works fine:

$ ~/llama.cpp/./server -c 4096 -ngl 33 --host 0.0.0.0 -t 8 -m ~/llama.cpp/models/CodeLlama-7B-Instruct-GGUF/codellama-7b-instruct.Q6_K.gguf
{"tid":"136899364565760","timestamp":1714076892,"level":"INFO","function":"main","line":2921,"msg":"build info","build":2689,"commit":"8dd1ec8b"}
{"tid":"136899364565760","timestamp":1714076892,"level":"INFO","function":"main","line":2926,"msg":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "}
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/ubuntu/llama.cpp/models/CodeLlama-7B-Instruct-GGUF/codellama-7b-instruct.Q6_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama_codellama-7b-instruct-hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 18
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 5.15 GiB (6.56 BPW) 
llm_load_print_meta: general.name     = codellama_codellama-7b-instruct-hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 9.0, VMM: no
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  5169.86 MiB
llm_load_tensors:        CPU buffer size =   102.59 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.24 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
{"tid":"136899364565760","timestamp":1714076900,"level":"INFO","function":"init","line":708,"msg":"initializing slots","n_slots":1}
{"tid":"136899364565760","timestamp":1714076900,"level":"INFO","function":"init","line":717,"msg":"new slot","id_slot":0,"n_ctx_slot":4096}
{"tid":"136899364565760","timestamp":1714076900,"level":"INFO","function":"main","line":3021,"msg":"model loaded"}
{"tid":"136899364565760","timestamp":1714076900,"level":"INFO","function":"main","line":3043,"msg":"chat template","chat_example":"<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n","built_in":true}
{"tid":"136899364565760","timestamp":1714076900,"level":"INFO","function":"main","line":3774,"msg":"HTTP server listening","n_threads_http":"11","port":"8080","hostname":"0.0.0.0"}
{"tid":"136899364565760","timestamp":1714076900,"level":"INFO","function":"update_slots","line":1786,"msg":"all slots are idle"}

Should I add -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=on -DAMDGPU_TARGETS=gfx900 for llama.cpp somewhere else? I would be very grateful for help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiling llamafile with UMA support in llama.cpp (for AMD APU) #366

{{title}}

Replies: 0 comments

Select a reply

Compiling llamafile with UMA support in llama.cpp (for AMD APU) #366

qkiel Apr 25, 2024

Replies: 0 comments

qkiel
Apr 25, 2024