Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] math_gen数据集评估随机失败 #1148

Open
2 tasks done
berton820 opened this issue May 14, 2024 · 4 comments
Open
2 tasks done

[Bug] math_gen数据集评估随机失败 #1148

berton820 opened this issue May 14, 2024 · 4 comments
Assignees

Comments

@berton820
Copy link

先决条件

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

{'CUDA available': True,
'CUDA_HOME': '/usr/local/cuda',
'GCC': 'gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0',
'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A800-SXM4-80GB',
'MMEngine': '0.10.4',
'MUSA available': False,
'NVCC': 'Cuda compilation tools, release 11.7, V11.7.64',
'OpenCV': '4.9.0',
'PyTorch': '2.3.0+cu121',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 9.3\n'
' - C++ Version: 201703\n'
' - Intel(R) oneAPI Math Kernel Library Version '
'2022.2-Product Build 20220804 for Intel(R) 64 '
'architecture applications\n'
' - Intel(R) MKL-DNN v3.3.6 (Git Hash '
'86e6af5974177e513fd3fee58425e1063e7f1361)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: AVX512\n'
' - CUDA Runtime 12.1\n'
' - NVCC architecture flags: '
'-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
' - CuDNN 8.9.2\n'
' - Magma 2.6.1\n'
' - Build settings: BLAS_INFO=mkl, '
'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
'CUDNN_VERSION=8.9.2, '
'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -fvisibility-inlines-hidden '
'-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
'-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
'-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
'-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
'-O2 -fPIC -Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wno-unused-parameter '
'-Wno-unused-function -Wno-unused-result '
'-Wno-strict-overflow -Wno-strict-aliasing '
'-Wno-stringop-overflow -Wsuggest-override '
'-Wno-psabi -Wno-error=pedantic '
'-Wno-error=old-style-cast -Wno-missing-braces '
'-fdiagnostics-color=always -faligned-new '
'-Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Wno-stringop-overflow, LAPACK_INFO=mkl, '
'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
'PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, '
'USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, '
'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, '
'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, '
'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, '
'USE_ROCM_KERNEL_ASSERT=OFF, \n',
'Python': '3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]',
'TorchVision': '0.18.0+cu121',
'numpy_random_seed': 2147483648,
'opencompass': '0.2.4+',
'sys.platform': 'linux'}

重现问题 - 代码/配置示例

math_gen数据集 跑 qwen1.5-1.8B官方模型

重现问题 - 命令或脚本

CUDA_VISIBLE_DEVICES=0 python run.py
--datasets math_gen
--hf-path local_Qwen1.5
--tokenizer-path local_Qwen1.5
--work-dir ./outputs/
--model-kwargs device_map='auto'
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False
--max-out-len 100
--max-seq-len 2048
--batch-size 8
--no-batch-padding
--num-gpus 1

重现问题 - 错误信息

opencompass/opencompass/runners/base.py - summarize - 64 - OpenICLInfer[opencompass.models.huggingface.HuggingFace_download_Qwen1.5-1.8B/math_24] failed with code 1

opencompass/opencompass/tasks/openicl_eval.py - _score - 239 - Task [opencompass.models.huggingface.HuggingFace_download_Qwen1.5-1.8B/math]: preds and refrs have different length

image
image
image

其他信息

我在基于opencompass评测qwen1.5官方未改动模型,math数据集会分成好几块,每一次跑的时候都会有不同的切块报错如下:本次就是math_24报错,之前还有math_6报错等

@liushz
Copy link
Collaborator

liushz commented May 14, 2024

Have you changed your partition logic midway? If you run it all at once, this problem shouldn't occur

@berton820
Copy link
Author

Have you changed your partition logic midway? If you run it all at once, this problem shouldn't occur

  1. I havenot made any changes
  2. to avoid unexpected bugs, i have also rm ~/.cache before running script

@liushz
Copy link
Collaborator

liushz commented May 14, 2024

The error log for your eval stage is because there are some errors during your infer stage, so the length of prediction is different with refs, you can check the following log:
image

@berton820
Copy link
Author

The error log for your eval stage is because there are some errors during your infer stage, so the length of prediction is different with refs, you can check the following log: image

Hi liushz,
log is here, bug i cannot get the point
image


W0513 14:40:06.153000 139835420161856 torch/distributed/elastic/agent/server/api.py:741] Received Signals.SIGHUP death signal, shutting down workers
W0513 14:40:06.154000 139835420161856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2678994 closing signal SIGHUP
Traceback (most recent call last):
File "/home/jovyan/anaconda3/envs/opencompass/bin/torchrun", line 8, in
sys.exit(main())
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent
result = agent.run()
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run
result = self._invoke_run(role)
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 876, in _invoke_run
time.sleep(monitor_interval)
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 76, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 2678928 got signal: 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants