Long context models don't split memory correctly leads to OOM error #4212

kungfu-eric · 2024-05-06T22:47:44Z

What is the issue?

Using mixtral default 2048 ctx splits memory across 2x GPUs ~12 GBs each. When extending context to 12k, it dumps all mem on one GPU using 29 GB. Ideally, would want to split equally as before to push to higher 16k context without OOM. Using 2x 48 GB A6000. Issue is possibly related to #1341

[GIN] 2024/05/05 - 23:38:16 | 200 |  24.89660366s |      172.17.0.1 | POST     "/api/chat"
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42754,"tid":"139643039125504","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42755,"tid":"139643039125504","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42756,"tid":"139643039125504","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1809,"msg":"slot progression","n_past":15,"n_past_se":0,"n_prompt_tokens_processed":15377,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496}
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":15,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496}
{"function":"print_timings","level":"INFO","line":269,"msg":"prompt eval time     =   24744.71 ms / 15377 tokens (    1.61 ms per token,   621.43 tokens per second)","n_prompt_tokens_processed":15377,"n_tokens_second":621.4256758605363,"slot_id":0,"t_prompt_processing":24744.713,"t_token":1.6092029004357156,"task_id":42757,"tid":"139643039125504","timestamp":1714977535}
{"function":"print_timings","level":"INFO","line":283,"msg":"generation eval time =   13787.47 ms /   550 runs   (   25.07 ms per token,    39.89 tokens per second)","n_decoded":550,"n_tokens_second":39.891292601180645,"slot_id":0,"t_token":25.06812727272727,"t_token_generation":13787.47,"task_id":42757,"tid":"139643039125504","timestamp":1714977535}
{"function":"print_timings","level":"INFO","line":293,"msg":"          total time =   38532.18 ms","slot_id":0,"t_prompt_processing":24744.713,"t_token_generation":13787.47,"t_total":38532.183,"task_id":42757,"tid":"139643039125504","timestamp":1714977535}
{"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":15942,"n_ctx":16384,"n_past":15941,"n_system_tokens":0,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977535,"truncated":false}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977535}
[GIN] 2024/05/05 - 23:38:55 | 200 | 38.672720153s |      172.17.0.1 | POST     "/api/chat"
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43310,"tid":"139643039125504","timestamp":1714977535}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43311,"tid":"139643039125504","timestamp":1714977535}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43312,"tid":"139643039125504","timestamp":17149775llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8x7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 24.62 GiB (4.53 BPW)
llm_load_print_meta: general.name     = mistralai
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.42 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size = 25215.87 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1145.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1200621568
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b'
{"function":"load_model","level":"ERR","line":410,"model":"/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b","msg":"unable to load model","tid":"140631930466304","timestamp":1714999945}
time=2024-05-06T05:52:25.670-07:00 level=ERROR source=sched.go:333 msg="error loading llama server" error="llama runner process no longer running: 1 error:failed to create context with model '/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b'"
[GIN] 2024/05/06 - 05:52:25 | 500 | 20.037722871s |      172.17.0.1 | POST     "/api/chat"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.33

The text was updated successfully, but these errors were encountered:

kungfu-eric added the bug Something isn't working label May 6, 2024

kungfu-eric mentioned this issue May 6, 2024

Improving the efficiency of using multiple GPU cards. #4198

Open

jmorganca added gpu nvidia Issues relating to Nvidia GPUs and CUDA labels May 7, 2024

dhiltgen assigned mxyng May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long context models don't split memory correctly leads to OOM error #4212

Long context models don't split memory correctly leads to OOM error #4212

kungfu-eric commented May 6, 2024

Long context models don't split memory correctly leads to OOM error #4212

Long context models don't split memory correctly leads to OOM error #4212

Comments

kungfu-eric commented May 6, 2024

What is the issue?

OS

GPU

CPU

Ollama version