pack_to_max_length = False后卡死 #525

dongjiancheng77 · 2024-03-28T11:53:10Z

pack_to_max_length = TRUE时可以正常训练，output如下，warming在true时也有：

03/28 19:32:38 - mmengine - WARNING - Use random port: 21418
[2024-03-28 19:32:53,988] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-28 19:33:14,422] torch.distributed.run: [WARNING] 
[2024-03-28 19:33:14,422] torch.distributed.run: [WARNING] *****************************************
[2024-03-28 19:33:14,422] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-03-28 19:33:14,422] torch.distributed.run: [WARNING] *****************************************
[2024-03-28 19:33:23,309] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-28 19:33:23,313] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/nfs03/anaconda3/envs/xtu/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
/home/nfs03/anaconda3/envs/xtu/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
03/28 19:33:45 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: linux
    Python: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0]
    CUDA available: True
    MUSA available: False
    numpy_random_seed: 1187594430
    GPU 0,1: NVIDIA GeForce RTX 3090
    CUDA_HOME: /home/nfs03/cuda_tools/cuda-11.8
    NVCC: Cuda compilation tools, release 11.8, V11.8.89
    GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
    PyTorch: 2.1.2+cu121
    PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7  (built against CUDA 11.8)
    - Built with CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

    TorchVision: 0.16.2+cu121
    OpenCV: 4.9.0
    MMEngine: 0.10.3

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 1187594430
    deterministic: False
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 2
------------------------------------------------------------

03/28 19:33:45 - mmengine - INFO - Config:
SYSTEM = 'xtuner.utils.SYSTEM_TEMPLATE.alpaca'
accumulative_counts = 16
alpaca_en = dict(
    dataset=dict(
        data_files=dict(
            train=
            '/home/nfs03/dongjc/grade-school-math/grade_school_math/data/train_converted_alpaca_v2.json'
        ),
        path='json',
        type='datasets.load_dataset'),
    dataset_map_fn='<function custom_map_fn at 0x7f72ca3e49d0>',
    max_length=2048,
    pack_to_max_length=False,
    remove_unused_columns=True,
    shuffle_before_pack=False,
    template_map_fn=dict(
        template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
        type='xtuner.dataset.map_fns.template_map_fn_factory'),
    tokenizer=dict(
        padding_side='right',
        pretrained_model_name_or_path='/home/nfs03/dongjc/Llama-2-7b-hf',
        trust_remote_code=True,
        type='transformers.AutoTokenizer.from_pretrained'),
    type='xtuner.dataset.process_hf_dataset',
    use_varlen_attn=False)
alpaca_en_path = '/home/nfs03/dongjc/grade-school-math/grade_school_math/data/train_converted_alpaca_v2.json'
batch_size = 1
betas = (
    0.9,
    0.999,
)
custom_hooks = [
    dict(
        tokenizer=dict(
            padding_side='right',
            pretrained_model_name_or_path='/home/nfs03/dongjc/Llama-2-7b-hf',
            trust_remote_code=True,
            type='transformers.AutoTokenizer.from_pretrained'),
        type='xtuner.engine.hooks.DatasetInfoHook'),
    dict(
        evaluation_inputs=[
            '请给我介绍五个上海的景点',
            'Please tell me five scenic spots in Shanghai',
        ],
        every_n_iters=500,
        prompt_template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
        system='xtuner.utils.SYSTEM_TEMPLATE.alpaca',
        tokenizer=dict(
            padding_side='right',
            pretrained_model_name_or_path='/home/nfs03/dongjc/Llama-2-7b-hf',
            trust_remote_code=True,
            type='transformers.AutoTokenizer.from_pretrained'),
        type='xtuner.engine.hooks.EvaluateChatHook'),
]
custom_map_fn = '<function custom_map_fn at 0x7f72ca3e49d0>'
dataloader_num_workers = 0
default_hooks = dict(
    checkpoint=dict(
        by_epoch=False,
        interval=500,
        max_keep_ckpts=2,
        type='mmengine.hooks.CheckpointHook'),
    logger=dict(
        interval=10,
        log_metric_by_epoch=False,
        type='mmengine.hooks.LoggerHook'),
    param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'),
    sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'),
    timer=dict(type='mmengine.hooks.IterTimerHook'))
env_cfg = dict(
    cudnn_benchmark=False,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
evaluation_freq = 500
evaluation_inputs = [
    '请给我介绍五个上海的景点',
    'Please tell me five scenic spots in Shanghai',
]
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=False)
lr = 0.0002
max_epochs = 3
max_length = 2048
max_norm = 1
model = dict(
    llm=dict(
        pretrained_model_name_or_path='/home/nfs03/dongjc/Llama-2-7b-hf',
        quantization_config=dict(
            bnb_4bit_compute_dtype='torch.float16',
            bnb_4bit_quant_type='nf4',
            bnb_4bit_use_double_quant=True,
            llm_int8_has_fp16_weight=False,
            llm_int8_threshold=6.0,
            load_in_4bit=True,
            load_in_8bit=False,
            type='transformers.BitsAndBytesConfig'),
        torch_dtype='torch.float16',
        trust_remote_code=True,
        type='transformers.AutoModelForCausalLM.from_pretrained'),
    lora=dict(
        bias='none',
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        task_type='CAUSAL_LM',
        type='peft.LoraConfig'),
    type='xtuner.model.SupervisedFinetune',
    use_varlen_attn=False)
optim_type = 'torch.optim.AdamW'
optim_wrapper = dict(
    accumulative_counts=16,
    clip_grad=dict(error_if_nonfinite=False, max_norm=1),
    dtype='float16',
    loss_scale='dynamic',
    optimizer=dict(
        betas=(
            0.9,
            0.999,
        ),
        lr=0.0002,
        type='torch.optim.AdamW',
        weight_decay=0),
    type='mmengine.optim.AmpOptimWrapper')
pack_to_max_length = False
param_scheduler = [
    dict(
        begin=0,
        by_epoch=True,
        convert_to_iter_based=True,
        end=0.09,
        start_factor=1e-05,
        type='mmengine.optim.LinearLR'),
    dict(
        begin=0.09,
        by_epoch=True,
        convert_to_iter_based=True,
        end=3,
        eta_min=0.0,
        type='mmengine.optim.CosineAnnealingLR'),
]
pretrained_model_name_or_path = '/home/nfs03/dongjc/Llama-2-7b-hf'
prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.llama2_chat'
randomness = dict(deterministic=False, seed=None)
resume = False
save_steps = 500
save_total_limit = 2
tokenizer = dict(
    padding_side='right',
    pretrained_model_name_or_path='/home/nfs03/dongjc/Llama-2-7b-hf',
    trust_remote_code=True,
    type='transformers.AutoTokenizer.from_pretrained')
train_cfg = dict(max_epochs=3, type='xtuner.engine.runner.TrainLoop')
train_dataloader = dict(
    batch_size=1,
    collate_fn=dict(
        type='xtuner.dataset.collate_fns.default_collate_fn',
        use_varlen_attn=False),
    dataset=dict(
        dataset=dict(
            data_files=dict(
                train=
                '/home/nfs03/dongjc/grade-school-math/grade_school_math/data/train_converted_alpaca_v2.json'
            ),
            path='json',
            type='datasets.load_dataset'),
        dataset_map_fn='<function custom_map_fn at 0x7f72ca3e49d0>',
        max_length=2048,
        pack_to_max_length=False,
        remove_unused_columns=True,
        shuffle_before_pack=False,
        template_map_fn=dict(
            template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
            type='xtuner.dataset.map_fns.template_map_fn_factory'),
        tokenizer=dict(
            padding_side='right',
            pretrained_model_name_or_path='/home/nfs03/dongjc/Llama-2-7b-hf',
            trust_remote_code=True,
            type='transformers.AutoTokenizer.from_pretrained'),
        type='xtuner.dataset.process_hf_dataset',
        use_varlen_attn=False),
    num_workers=0,
    sampler=dict(shuffle=False, type='mmengine.dataset.DefaultSampler'))
use_varlen_attn = False
visualizer = None
warmup_ratio = 0.03
weight_decay = 0
work_dir = './work_dirs/llama2_7b_qlora_alpaca_e3_copy4'

quantization_config convert to <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>
quantization_config convert to <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>
03/28 19:33:46 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]/home/nfs03/anaconda3/envs/xtu/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/home/nfs03/anaconda3/envs/xtu/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()

Loading checkpoint shards:  50%|█████     | 1/2 [05:21<05:21, 321.34s/it]
Loading checkpoint shards:  50%|█████     | 1/2 [05:21<05:21, 321.66s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [06:09<00:00, 160.57s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [06:09<00:00, 184.69s/it]

Loading checkpoint shards: 100%|██████████| 2/2 [06:09<00:00, 160.63s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [06:09<00:00, 184.78s/it]
03/28 19:39:57 - mmengine - WARNING - Due to the implementation of the PyTorch version of flash attention, even when the `output_attentions` flag is set to True, it is not possible to return the `attn_weights`.
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:39:57 - mmengine - INFO - dispatch llama attn forward
03/28 19:40:00 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) RuntimeInfoHook                    
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
before_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(NORMAL      ) DatasetInfoHook                    
(LOW         ) EvaluateChatHook                   
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(NORMAL      ) DistSamplerSeedHook                
 -------------------- 
before_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
(LOW         ) ParamSchedulerHook                 
(LOW         ) EvaluateChatHook                   
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) IterTimerHook                      
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_val:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) DatasetInfoHook                    
 -------------------- 
before_val_epoch:
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_val_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_val_iter:
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_val_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_val:
(VERY_HIGH   ) RuntimeInfoHook                    
(LOW         ) EvaluateChatHook                   
 -------------------- 
after_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(LOW         ) EvaluateChatHook                   
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_test:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) DatasetInfoHook                    
 -------------------- 
before_test_epoch:
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_test_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_test_iter:
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_test_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_test:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
after_run:
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
03/28 19:40:00 - mmengine - INFO - xtuner_dataset_timeout = 0:30:00

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 7473 examples [00:00, 54466.65 examples/s]
Generating train split: 7473 examples [00:00, 53437.59 examples/s]

Map (num_proc=32):   0%|          | 0/7473 [00:00<?, ? examples/s]
Map (num_proc=32):   3%|▎         | 234/7473 [00:00<00:03, 1866.83 examples/s]
Map (num_proc=32):  69%|██████▉   | 5143/7473 [00:00<00:00, 26981.33 examples/s]
Map (num_proc=32): 100%|██████████| 7473/7473 [00:00<00:00, 16970.98 examples/s]

Map (num_proc=32):   0%|          | 0/7473 [00:00<?, ? examples/s]
Map (num_proc=32):  22%|██▏       | 1638/7473 [00:00<00:00, 13734.41 examples/s]
Map (num_proc=32):  97%|█████████▋| 7240/7473 [00:00<00:00, 36191.57 examples/s]
Map (num_proc=32): 100%|██████████| 7473/7473 [00:00<00:00, 18992.64 examples/s]

Filter (num_proc=32):   0%|          | 0/7473 [00:00<?, ? examples/s]
Filter (num_proc=32):   3%|▎         | 234/7473 [00:00<00:03, 1939.20 examples/s]
Filter (num_proc=32):  75%|███████▌  | 5609/7473 [00:00<00:00, 29778.24 examples/s]
Filter (num_proc=32): 100%|██████████| 7473/7473 [00:00<00:00, 19898.48 examples/s]

Map (num_proc=32):   0%|          | 0/7473 [00:00<?, ? examples/s]
Map (num_proc=32):   0%|          | 34/7473 [00:00<00:59, 125.53 examples/s]
Map (num_proc=32):   4%|▍         | 297/7473 [00:00<00:08, 860.94 examples/s]
Map (num_proc=32):   9%|▉         | 692/7473 [00:00<00:03, 1776.56 examples/s]
Map (num_proc=32):  15%|█▌        | 1129/7473 [00:00<00:02, 2248.65 examples/s]
Map (num_proc=32):  22%|██▏       | 1671/7473 [00:00<00:01, 2996.98 examples/s]
Map (num_proc=32):  33%|███▎      | 2450/7473 [00:00<00:01, 3953.86 examples/s]
Map (num_proc=32):  42%|████▏     | 3162/7473 [00:01<00:00, 4785.75 examples/s]
Map (num_proc=32):  50%|████▉     | 3702/7473 [00:01<00:00, 4618.43 examples/s]
Map (num_proc=32):  58%|█████▊    | 4323/7473 [00:01<00:00, 4744.55 examples/s]
Map (num_proc=32):  66%|██████▌   | 4948/7473 [00:01<00:00, 4433.81 examples/s]
Map (num_proc=32):  73%|███████▎  | 5426/7473 [00:01<00:00, 4099.89 examples/s]
Map (num_proc=32):  87%|████████▋ | 6490/7473 [00:01<00:00, 5636.64 examples/s]
Map (num_proc=32):  95%|█████████▌| 7128/7473 [00:01<00:00, 4864.11 examples/s]
Map (num_proc=32): 100%|██████████| 7473/7473 [00:02<00:00, 3517.54 examples/s]

Filter (num_proc=32):   0%|          | 0/7473 [00:00<?, ? examples/s]
Filter (num_proc=32):   3%|▎         | 234/7473 [00:00<00:04, 1511.87 examples/s]
Filter (num_proc=32):  66%|██████▌   | 4905/7473 [00:00<00:00, 22945.72 examples/s]
Filter (num_proc=32): 100%|██████████| 7473/7473 [00:00<00:00, 17139.14 examples/s]

Map (num_proc=32):   0%|          | 0/7473 [00:00<?, ? examples/s]
Map (num_proc=32):   3%|▎         | 230/7473 [00:00<00:04, 1566.88 examples/s]
Map (num_proc=32):  65%|██████▍   | 4821/7473 [00:00<00:00, 23358.70 examples/s]
Map (num_proc=32): 100%|██████████| 7473/7473 [00:00<00:00, 17001.77 examples/s]
03/28 19:40:22 - mmengine - WARNING - Dataset Dataset has no metainfo. ``dataset_meta`` in visualizer will be None.
03/28 19:40:22 - mmengine - INFO - Num train samples 7473
03/28 19:40:22 - mmengine - INFO - train example:
03/28 19:40:22 - mmengine - INFO - <s> [INST] <<SYS>>
 You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

<</SYS>>
 [/INST] [INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
 [/INST] Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72</s> 

03/28 19:40:22 - mmengine - INFO - before_train in EvaluateChatHook.
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.
03/28 19:40:31 - mmengine - INFO - Sample output:
<s> [INST] <<SYS>>
 You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

<</SYS>>
 [/INST] [INST] 请给我介绍五个上海的景点 [/INST]

[INST] 请给我介绍五个上海的景点 [/INST]

[INST] 请给我介绍五个上海的景点 [/INST]



03/28 19:40:36 - mmengine - INFO - Sample output:
<s> [INST] <<SYS>>
 You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

<</SYS>>
 [/INST] [INST] Please tell me five scenic spots in Shanghai [/INST]

[INST] <<SYS>> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist

03/28 19:40:36 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
03/28 19:40:36 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
03/28 19:40:36 - mmengine - INFO - Checkpoints will be saved to /home/nfs03/dongjc/4/work_dirs/llama2_7b_qlora_alpaca_e3_copy4.
/home/nfs03/anaconda3/envs/xtu/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/home/nfs03/anaconda3/envs/xtu/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/home/nfs03/anaconda3/envs/xtu/lib/python3.10/site-packages/mmengine/optim/scheduler/param_scheduler.py:198: UserWarning: Detected call of `scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the parameter value schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn(
/home/nfs03/anaconda3/envs/xtu/lib/python3.10/site-packages/mmengine/optim/scheduler/param_scheduler.py:198: UserWarning: Detected call of `scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the parameter value schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn(

The text was updated successfully, but these errors were encountered:

LZHgrla · 2024-04-01T02:53:41Z

@dongjiancheng77
一般来说，调整pack_to_max_length为False后，不会引入任何问题。（Warning 忽视即可）

因此，请首先检查一下config中其他部分是否被修改、机器是否有其他占用。如果还有问题，请张贴一下pack_to_max_length为False时的config和log，并详细描述一下所执行的命令和现象

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pack_to_max_length = False后卡死 #525

pack_to_max_length = False后卡死 #525

dongjiancheng77 commented Mar 28, 2024 •

edited by LZHgrla

LZHgrla commented Apr 1, 2024

pack_to_max_length = False后卡死 #525

pack_to_max_length = False后卡死 #525

Comments

dongjiancheng77 commented Mar 28, 2024 • edited by LZHgrla

LZHgrla commented Apr 1, 2024

dongjiancheng77 commented Mar 28, 2024 •

edited by LZHgrla