Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

提交到slurm集群导致的端口冲突 #305

Open
aozaki-touko opened this issue Mar 16, 2024 · 0 comments
Open

提交到slurm集群导致的端口冲突 #305

aozaki-touko opened this issue Mar 16, 2024 · 0 comments

Comments

@aozaki-touko
Copy link

运行命令

#!/bin/sh
#SBATCH -J=finetune-clm
#SBATCH -o job-%j.log
#SBATCH -e job-%j.err
#SBATCH --nodes=1
#SBATCH --partition=GPU-A100
#SBATCH --time=01:00:00
#SBATCH --mem=64G
#SBATCH --gres=gpu:a100:1
#SBATCH --qos=qos_a100_gpu
module load anaconda3
module load ~/modulefiles/cuda/12.1
source /opt/anaconda3/etc/profile.d/conda.sh
conda activate llama_env

output_model=../../Llama_model
# 需要修改到自己的输入目录
if [ ! -d ${output_model} ];then  
    mkdir ${output_model}
fi
cp ./finetune.sh ${output_model}
deepspeed --include localhost:0     --master_port 29555  finetune_clm_lora.py \
    --model_name_or_path ../../../Llama-2-7b-chat-hf \
    --train_files ../../data/train.csv \
    --validation_files  ../../data/val.csv \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --do_train \
    --do_eval \
    --use_fast_tokenizer false \
    --output_dir ${output_model} \
    --evaluation_strategy  steps \
    --max_eval_samples 800 \
    --learning_rate 1e-4 \
    --gradient_accumulation_steps 8 \
    --num_train_epochs 1 \
    --warmup_steps 400 \
    --load_in_bits 4 \
    --lora_r 8 \
    --lora_alpha 32 \
    --target_modules q_proj,k_proj,v_proj,o_proj,down_proj,gate_proj,up_proj \
    --logging_dir ${output_model}/logs \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --preprocessing_num_workers 10 \
    --save_steps 20 \
    --eval_steps 20 \
    --save_total_limit 2000 \
    --seed 42 \
    --disable_tqdm false \
    --ddp_find_unused_parameters false \
    --block_size 2048 \
    --overwrite_output_dir \
    --deepspeed ds_config_zero2.json \
    --ignore_data_skip true \
    --bf16 \
    --gradient_checkpointing \
    --bf16_full_eval \
    --ddp_timeout 18000000 \
    | tee -a ${output_model}/train.log
    


    # --resume_from_checkpoint ${output_model}/checkpoint-20400 \



    # --resume_from_checkpoint ${output_model}/checkpoint-20400 \

已经在运行命令中修改了deepspeed的端口,仍会出现以下报错:

[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/gpfs/home/ljgroup/touko/Llama-Chinese/train/sft/finetune_clm_lora.py", line 692, in <module>
    main()
  File "/gpfs/home/ljgroup/touko/Llama-Chinese/train/sft/finetune_clm_lora.py", line 281, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 123, in __init__
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/transformers/training_args.py", line 1528, in __post_init__
    and (self.device.type != "cuda")
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/transformers/training_args.py", line 1995, in device
    return self._setup_devices
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/transformers/utils/generic.py", line 56, in __get__
    cached = self.fget(obj)
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/transformers/training_args.py", line 1927, in _setup_devices
    self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/accelerate/state.py", line 190, in __init__
    dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 121, in __init__
    self.init_process_group(backend, timeout, init_method, rank, world_size)
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 149, in init_process_group
    torch.distributed.init_process_group(backend,
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1141, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 241, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).

我查找了finetune_clm_lora.py也没有找到可以修改此类端口的地方

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant