Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run video captioning code #34

Open
Davidyao99 opened this issue Jun 2, 2022 · 3 comments
Open

Unable to run video captioning code #34

Davidyao99 opened this issue Jun 2, 2022 · 3 comments

Comments

@Davidyao99
Copy link

I followed the steps in downloading all the necessary dependencies and data to run the code. When running the code, this error is thrown:

in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['<path to python executable>', '-u', 'main_task_caption.py', '--local_rank=3', '--do_train', '--num_thread_reader=4', '--epochs=5', '--batch_size=128', '--n_display=100', '--train_csv', 'data/msrvtt/MSRVTT_train.9k.csv', '--val_csv', 'data/msrvtt/MSRVTT_JSFUSION_test.csv', '--data_path', 'data/msrvtt/MSRVTT_data.json', '--features_path', 'data/msrvtt/msrvtt_videos_features.pickle', '--output_dir', 'ckpts/ckpt_msrvtt_caption', '--bert_model', 'bert-base-uncased', '--do_lower_case', '--lr', '3e-5', '--max_words', '48', '--max_frames', '48', '--batch_size_val', '32', '--visual_num_hidden_layers', '6', '--decoder_num_hidden_layers', '3', '--datatype', 'msrvtt', '--stage_two', '--init_model', 'weight/univl.pretrained.bin']' returned non-zero exit status 1.

There is only 1 gpu on my laptop so I am not sure if this is causing the issue. I just wanted to try out the video captioning capability of this model. Thank you!

@ArrowLuo
Copy link
Contributor

ArrowLuo commented Jun 3, 2022

Hi @Davidyao99, I guess you should use python -m torch.distributed.launch --nproc_per_node=1 for 1 GPU instead of python -m torch.distributed.launch --nproc_per_node=4. If nothing is right after that, printing more logs here will be useful to solve the problem. Good luck~

@Davidyao99
Copy link
Author

Davidyao99 commented Jun 3, 2022

Thank you for responding! I ran the command with --nproc_per_node=1 and received the following error:

Traceback (most recent call last): File "main_task_caption.py", line 24, in <module> torch.distributed.init_process_group(backend="nccl") File "/mnt/c/users/dyao/documents/research/UniVL/univl/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/mnt/c/users/dyao/documents/research/UniVL/univl/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/c/users/dyao/documents/research/UniVL/univl/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module> main() File "/mnt/c/users/dyao/documents/research/UniVL/univl/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/mnt/c/users/dyao/documents/research/UniVL/univl/bin/python', '-u', 'main_task_caption.py', '--local_rank=0', '--do_train', '--num_thread_reader=4', '--epochs=5', '--batch_size=128', '--n_display=100', '--train_csv', 'data/msrvtt/MSRVTT_train.9k.csv', '--val_csv', 'data/msrvtt/MSRVTT_JSFUSION_test.csv', '--data_path', 'data/msrvtt/MSRVTT_data.json', '--features_path', 'data/msrvtt/msrvtt_videos_features.pickle', '--output_dir', '/ckpt_msrvtt_caption', '--bert_model', 'bert-base-uncased', '--do_lower_case', '--lr', '3e-5', '--max_words', '48', '--max_frames', '48', '--batch_size_val', '32', '--visual_num_hidden_layers', '6', '--decoder_num_hidden_layers', '3', '--datatype', 'msrvtt', '--stage_two', '--init_model', 'weight/univl.pretrained.bin']' returned non-zero exit status 1.

Thank you for your time and help! I am not familiar with pytorch distribution, so sorry.

@ArrowLuo
Copy link
Contributor

ArrowLuo commented Jun 4, 2022

Hi @Davidyao99, what is your whole command?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants