[Bug] math_gen数据集评估随机失败 #1148

berton820 · 2024-05-14T09:12:21Z

先决条件

我已经搜索过问题和讨论但未得到预期的帮助。
错误在最新版本中尚未被修复。

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

{'CUDA available': True,
'CUDA_HOME': '/usr/local/cuda',
'GCC': 'gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0',
'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A800-SXM4-80GB',
'MMEngine': '0.10.4',
'MUSA available': False,
'NVCC': 'Cuda compilation tools, release 11.7, V11.7.64',
'OpenCV': '4.9.0',
'PyTorch': '2.3.0+cu121',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 9.3\n'
' - C++ Version: 201703\n'
' - Intel(R) oneAPI Math Kernel Library Version '
'2022.2-Product Build 20220804 for Intel(R) 64 '
'architecture applications\n'
' - Intel(R) MKL-DNN v3.3.6 (Git Hash '
'86e6af5974177e513fd3fee58425e1063e7f1361)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: AVX512\n'
' - CUDA Runtime 12.1\n'
' - NVCC architecture flags: '
'-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
' - CuDNN 8.9.2\n'
' - Magma 2.6.1\n'
' - Build settings: BLAS_INFO=mkl, '
'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
'CUDNN_VERSION=8.9.2, '
'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -fvisibility-inlines-hidden '
'-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
'-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
'-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
'-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
'-O2 -fPIC -Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wno-unused-parameter '
'-Wno-unused-function -Wno-unused-result '
'-Wno-strict-overflow -Wno-strict-aliasing '
'-Wno-stringop-overflow -Wsuggest-override '
'-Wno-psabi -Wno-error=pedantic '
'-Wno-error=old-style-cast -Wno-missing-braces '
'-fdiagnostics-color=always -faligned-new '
'-Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Wno-stringop-overflow, LAPACK_INFO=mkl, '
'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
'PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, '
'USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, '
'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, '
'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, '
'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, '
'USE_ROCM_KERNEL_ASSERT=OFF, \n',
'Python': '3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]',
'TorchVision': '0.18.0+cu121',
'numpy_random_seed': 2147483648,
'opencompass': '0.2.4+',
'sys.platform': 'linux'}

重现问题 - 代码/配置示例

math_gen数据集跑 qwen1.5-1.8B官方模型

重现问题 - 命令或脚本

CUDA_VISIBLE_DEVICES=0 python run.py
--datasets math_gen
--hf-path local_Qwen1.5
--tokenizer-path local_Qwen1.5
--work-dir ./outputs/
--model-kwargs device_map='auto'
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False
--max-out-len 100
--max-seq-len 2048
--batch-size 8
--no-batch-padding
--num-gpus 1

重现问题 - 错误信息

opencompass/opencompass/runners/base.py - summarize - 64 - OpenICLInfer[opencompass.models.huggingface.HuggingFace_download_Qwen1.5-1.8B/math_24] failed with code 1

opencompass/opencompass/tasks/openicl_eval.py - _score - 239 - Task [opencompass.models.huggingface.HuggingFace_download_Qwen1.5-1.8B/math]: preds and refrs have different length

其他信息

我在基于opencompass评测qwen1.5官方未改动模型，math数据集会分成好几块，每一次跑的时候都会有不同的切块报错如下：本次就是math_24报错，之前还有math_6报错等

liushz · 2024-05-14T09:18:55Z

Have you changed your partition logic midway? If you run it all at once, this problem shouldn't occur

berton820 · 2024-05-14T09:35:31Z

Have you changed your partition logic midway? If you run it all at once, this problem shouldn't occur

I havenot made any changes
to avoid unexpected bugs, i have also rm ~/.cache before running script

liushz · 2024-05-14T09:57:35Z

The error log for your eval stage is because there are some errors during your infer stage, so the length of prediction is different with refs, you can check the following log:

berton820 · 2024-05-14T10:22:28Z

The error log for your eval stage is because there are some errors during your infer stage, so the length of prediction is different with refs, you can check the following log:

Hi liushz,
log is here, bug i cannot get the point

W0513 14:40:06.153000 139835420161856 torch/distributed/elastic/agent/server/api.py:741] Received Signals.SIGHUP death signal, shutting down workers
W0513 14:40:06.154000 139835420161856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2678994 closing signal SIGHUP
Traceback (most recent call last):
File "/home/jovyan/anaconda3/envs/opencompass/bin/torchrun", line 8, in
sys.exit(main())
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent
result = agent.run()
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run
result = self._invoke_run(role)
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 876, in _invoke_run
time.sleep(monitor_interval)
File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 76, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 2678928 got signal: 1

mm-assistant bot assigned bittersweet1999 May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] math_gen数据集评估随机失败 #1148

[Bug] math_gen数据集评估随机失败 #1148

berton820 commented May 14, 2024

liushz commented May 14, 2024

berton820 commented May 14, 2024

liushz commented May 14, 2024

berton820 commented May 14, 2024

[Bug] math_gen数据集评估随机失败 #1148

[Bug] math_gen数据集评估随机失败 #1148

Comments

berton820 commented May 14, 2024

先决条件

问题类型

环境

重现问题 - 代码/配置示例

重现问题 - 命令或脚本

重现问题 - 错误信息

其他信息

liushz commented May 14, 2024

berton820 commented May 14, 2024

liushz commented May 14, 2024

berton820 commented May 14, 2024