Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long context models don't split memory correctly leads to OOM error #4212

Open
kungfu-eric opened this issue May 6, 2024 · 0 comments
Open
Assignees
Labels
bug Something isn't working gpu nvidia Issues relating to Nvidia GPUs and CUDA

Comments

@kungfu-eric
Copy link

What is the issue?

Using mixtral default 2048 ctx splits memory across 2x GPUs ~12 GBs each. When extending context to 12k, it dumps all mem on one GPU using 29 GB. Ideally, would want to split equally as before to push to higher 16k context without OOM. Using 2x 48 GB A6000. Issue is possibly related to #1341

[GIN] 2024/05/05 - 23:38:16 | 200 |  24.89660366s |      172.17.0.1 | POST     "/api/chat"
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42754,"tid":"139643039125504","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42755,"tid":"139643039125504","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42756,"tid":"139643039125504","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1809,"msg":"slot progression","n_past":15,"n_past_se":0,"n_prompt_tokens_processed":15377,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496}
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":15,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496}
{"function":"print_timings","level":"INFO","line":269,"msg":"prompt eval time     =   24744.71 ms / 15377 tokens (    1.61 ms per token,   621.43 tokens per second)","n_prompt_tokens_processed":15377,"n_tokens_second":621.4256758605363,"slot_id":0,"t_prompt_processing":24744.713,"t_token":1.6092029004357156,"task_id":42757,"tid":"139643039125504","timestamp":1714977535}
{"function":"print_timings","level":"INFO","line":283,"msg":"generation eval time =   13787.47 ms /   550 runs   (   25.07 ms per token,    39.89 tokens per second)","n_decoded":550,"n_tokens_second":39.891292601180645,"slot_id":0,"t_token":25.06812727272727,"t_token_generation":13787.47,"task_id":42757,"tid":"139643039125504","timestamp":1714977535}
{"function":"print_timings","level":"INFO","line":293,"msg":"          total time =   38532.18 ms","slot_id":0,"t_prompt_processing":24744.713,"t_token_generation":13787.47,"t_total":38532.183,"task_id":42757,"tid":"139643039125504","timestamp":1714977535}
{"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":15942,"n_ctx":16384,"n_past":15941,"n_system_tokens":0,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977535,"truncated":false}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977535}
[GIN] 2024/05/05 - 23:38:55 | 200 | 38.672720153s |      172.17.0.1 | POST     "/api/chat"
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43310,"tid":"139643039125504","timestamp":1714977535}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43311,"tid":"139643039125504","timestamp":1714977535}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43312,"tid":"139643039125504","timestamp":17149775llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8x7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 24.62 GiB (4.53 BPW)
llm_load_print_meta: general.name     = mistralai
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.42 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size = 25215.87 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1145.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1200621568
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b'
{"function":"load_model","level":"ERR","line":410,"model":"/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b","msg":"unable to load model","tid":"140631930466304","timestamp":1714999945}
time=2024-05-06T05:52:25.670-07:00 level=ERROR source=sched.go:333 msg="error loading llama server" error="llama runner process no longer running: 1 error:failed to create context with model '/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b'"
[GIN] 2024/05/06 - 05:52:25 | 500 | 20.037722871s |      172.17.0.1 | POST     "/api/chat"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.33

@kungfu-eric kungfu-eric added the bug Something isn't working label May 6, 2024
@jmorganca jmorganca added gpu nvidia Issues relating to Nvidia GPUs and CUDA labels May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gpu nvidia Issues relating to Nvidia GPUs and CUDA
Projects
None yet
Development

No branches or pull requests

3 participants