Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault on finetune with -ngl > 0, Debian 12 stable #6994

Open
Basiliotornado opened this issue Apr 30, 2024 · 1 comment
Open

Comments

@Basiliotornado
Copy link

Specs: rtx 3060ti w/ 8gb vram, r7 5700x, 32gb ram

main says
main: build = 2769 (8843a98c) main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

make says
GNU Make 4.3 Built for x86_64-pc-linux-gnu Copyright (C) 1988-2020 Free Software Foundation, Inc.

Compiled using make LLAMA_CUDA=1, with CUDA 11.8.89~11.8.0-5~deb12u1. Ran using ./finetune --model-base ./models/tinyllama1.1b.gguf --train-data ../data.txt -ngl 100 on TinyLlama-1.1B-Chat

Nothing seems to be off when using it compiled with debug

Output of finetune in debug ~/Downloads/Llama/llama.cpp$ ./finetune --model-base ./models/tinyllama1.1b.gguf --train-data ../data.txt -ngl 100 main: seed: 1714445201 main: model base = './models/tinyllama1.1b.gguf' llama_model_loader: loaded meta data with 22 key-value pairs and 201 tensors from ./models/tinyllama1.1b.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = TinyLlama_TinyLlama-1.1B-Chat-v1.0 llama_model_loader: - kv 2: llama.block_count u32 = 22 llama_model_loader: - kv 3: llama.context_length u32 = 2048 llama_model_loader: - kv 4: llama.embedding_length u32 = 2048 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 1 llama_model_loader: - kv 11: llama.vocab_size u32 = 32000 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 13: tokenizer.ggml.model str = llama llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m... llama_model_loader: - type f32: 45 tensors llama_model_loader: - type f16: 156 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 22 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 5632 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 1.10 B llm_load_print_meta: model size = 2.05 GiB (16.00 BPW) llm_load_print_meta: general.name = TinyLlama_TinyLlama-1.1B-Chat-v1.0 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 2 '' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.20 MiB llm_load_tensors: offloading 22 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 23/23 layers to GPU llm_load_tensors: CPU buffer size = 125.00 MiB llm_load_tensors: CUDA0 buffer size = 1973.35 MiB .......................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 11.00 MiB llama_new_context_with_model: KV self size = 11.00 MiB, K (f16): 5.50 MiB, V (f16): 5.50 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 66.50 MiB ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 5.01 MiB llama_new_context_with_model: CUDA0 compute buffer size = 66.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 5.01 MiB llama_new_context_with_model: graph nodes = 710 llama_new_context_with_model: graph splits = 2 main: init model print_params: n_vocab : 32000 print_params: n_ctx : 128 print_params: n_embd : 2048 print_params: n_ff : 5632 print_params: n_head : 32 print_params: n_head_kv : 4 print_params: n_layer : 22 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 print_lora_params: n_rank_attention_norm : 1 print_lora_params: n_rank_wq : 4 print_lora_params: n_rank_wk : 4 print_lora_params: n_rank_wv : 4 print_lora_params: n_rank_wo : 4 print_lora_params: n_rank_ffn_norm : 1 print_lora_params: n_rank_ffn_gate : 4 print_lora_params: n_rank_ffn_down : 4 print_lora_params: n_rank_ffn_up : 4 print_lora_params: n_rank_tok_embeddings : 4 print_lora_params: n_rank_norm : 1 print_lora_params: n_rank_output : 4 main: total train_iterations 0 main: seen train_samples 0 main: seen train_tokens 0 main: completed train_epochs 0 main: lora_size = 28472224 bytes (27.2 MB) main: opt_size = 42223360 bytes (40.3 MB) main: opt iter 0 main: input_size = 131076128 bytes (125.0 MB) ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 3825.01 MiB ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 4706.01 MiB main: compute_size = 4010812000 bytes (3825.0 MB) main: evaluation order = LEFT_TO_RIGHT ggml_gallocr_needs_realloc: graph has different number of nodes ggml_gallocr_alloc_graph: reallocating buffers automatically ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 3825.01 MiB main: tokenize training data from ../data.txt main: sample-start: main: include-sample-start: false tokenize_file: total number of samples: 31763 main: number of training tokens: 31891 main: number of unique tokens: 3308 main: train data seems to have changed. restarting shuffled epoch. main: begin training main: work_size = 768376 bytes (0.7 MB) train_opt_callback: iter= 0 sample=1/31763 sched=0.000000 loss=0.000000 |-> Segmentation fault (core dumped)
In KDE System Monitor, it seems to crash before anything can be done on the GPU
@Basiliotornado
Copy link
Author

Basiliotornado commented Apr 30, 2024

Occasionally I'll get a segfault in main as well. Albeit, using text-generation-webui, so likely on an old version of llamacpp. Doubt it's the same issue but thought i'd share.

GGML_ASSERT: /home/runner/work/llama-cpp-python-cuBLAS-wheels/llama-cpp-python-cuBLAS-wheels/vendor/llama.cpp/ggml.c:5513: a->ne[2] == b->ne[0]
Segmentation fault (core dumped)
Unable to attach: program terminated with signal SIGSEGV, Segmentation fault.
No stack.
The program is not being run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant