You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using mixtral default 2048 ctx splits memory across 2x GPUs ~12 GBs each. When extending context to 12k, it dumps all mem on one GPU using 29 GB. Ideally, would want to split equally as before to push to higher 16k context without OOM. Using 2x 48 GB A6000. Issue is possibly related to #1341
[GIN] 2024/05/05 - 23:38:16 | 200 | 24.89660366s | 172.17.0.1 | POST "/api/chat"
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42754,"tid":"139643039125504","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42755,"tid":"139643039125504","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42756,"tid":"139643039125504","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1809,"msg":"slot progression","n_past":15,"n_past_se":0,"n_prompt_tokens_processed":15377,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496}
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":15,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496}
{"function":"print_timings","level":"INFO","line":269,"msg":"prompt eval time = 24744.71 ms / 15377 tokens ( 1.61 ms per token, 621.43 tokens per second)","n_prompt_tokens_processed":15377,"n_tokens_second":621.4256758605363,"slot_id":0,"t_prompt_processing":24744.713,"t_token":1.6092029004357156,"task_id":42757,"tid":"139643039125504","timestamp":1714977535}
{"function":"print_timings","level":"INFO","line":283,"msg":"generation eval time = 13787.47 ms / 550 runs ( 25.07 ms per token, 39.89 tokens per second)","n_decoded":550,"n_tokens_second":39.891292601180645,"slot_id":0,"t_token":25.06812727272727,"t_token_generation":13787.47,"task_id":42757,"tid":"139643039125504","timestamp":1714977535}
{"function":"print_timings","level":"INFO","line":293,"msg":" total time = 38532.18 ms","slot_id":0,"t_prompt_processing":24744.713,"t_token_generation":13787.47,"t_total":38532.183,"task_id":42757,"tid":"139643039125504","timestamp":1714977535}
{"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":15942,"n_ctx":16384,"n_past":15941,"n_system_tokens":0,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977535,"truncated":false}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977535}
[GIN] 2024/05/05 - 23:38:55 | 200 | 38.672720153s | 172.17.0.1 | POST "/api/chat"
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43310,"tid":"139643039125504","timestamp":1714977535}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43311,"tid":"139643039125504","timestamp":1714977535}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43312,"tid":"139643039125504","timestamp":17149775llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8x7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 46.70 B
llm_load_print_meta: model size = 24.62 GiB (4.53 BPW)
llm_load_print_meta: general.name = mistralai
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.42 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 25215.87 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 16384
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.14 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1145.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1200621568
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b'
{"function":"load_model","level":"ERR","line":410,"model":"/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b","msg":"unable to load model","tid":"140631930466304","timestamp":1714999945}
time=2024-05-06T05:52:25.670-07:00 level=ERROR source=sched.go:333 msg="error loading llama server" error="llama runner process no longer running: 1 error:failed to create context with model '/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b'"
[GIN] 2024/05/06 - 05:52:25 | 500 | 20.037722871s | 172.17.0.1 | POST "/api/chat"
OS
Linux
GPU
Nvidia
CPU
AMD
Ollama version
0.1.33
The text was updated successfully, but these errors were encountered:
What is the issue?
Using mixtral default 2048 ctx splits memory across 2x GPUs ~12 GBs each. When extending context to 12k, it dumps all mem on one GPU using 29 GB. Ideally, would want to split equally as before to push to higher 16k context without OOM. Using 2x 48 GB A6000. Issue is possibly related to #1341
OS
Linux
GPU
Nvidia
CPU
AMD
Ollama version
0.1.33
The text was updated successfully, but these errors were encountered: