Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Medusa Training Loss #95

Open
TomYang-TZ opened this issue Apr 7, 2024 · 5 comments
Open

Medusa Training Loss #95

TomYang-TZ opened this issue Apr 7, 2024 · 5 comments

Comments

@TomYang-TZ
Copy link

When utilizing Axolotl, the training loss reduces to 0 following the gradient accumulation steps. Is this expected behaviour?
image

With Torchrun, the training loss consistently remains NaN.
image

Thanks for the help!! Here is the training configuration:
base_model: teknium/OpenHermes-2.5-Mistral-7B
base_model_config: teknium/OpenHermes-2.5-Mistral-7B
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: false

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:

  • path: ShareGPT_Vicuna_unfiltered/ShareGPT_V4.3_unfiltered_cleaned_split.json
    type: sharegpt
    dataset_prepared_path:
    val_set_size: 0.1
    output_dir: ./openhermes7B_medusa_stage1

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0005

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
use_reentrant: True

warmup_steps: 40
eval_steps: 0.01
evaluation_strategy: steps
save_strategy: steps
save_steps:
save_total_limit: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: ""
eos_token: "<|im_end|>"
unk_token: ""

medusa_num_heads: 5
medusa_num_layers: 1
medusa_heads_coefficient: 0.2
medusa_decay_coefficient: 0.8
medusa_logging: true
medusa_scheduler: constant
medusa_lr_multiplier: 4.0
medusa_only_heads: true
ddp_find_unused_parameters: true

@vivekmadan2
Copy link

I am also facing the same issue with Mistral example listed in the repo.

@FatPigeorz
Copy link

same issue

@xiaoruirui356
Copy link

Have you solved this problem?

@TomYang-TZ
Copy link
Author

Unfortunately no

@xiaoruirui356
Copy link

I find some problems with the data,you can check it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants