Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to solve the following problem with xtuner on Slurm? #667

Open
har77774 opened this issue May 9, 2024 · 1 comment
Open

How to solve the following problem with xtuner on Slurm? #667

har77774 opened this issue May 9, 2024 · 1 comment

Comments

@har77774
Copy link

har77774 commented May 9, 2024

Traceback (most recent call last):
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_unix.py", line 43, in _acquire
    fcntl.flock(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
OSError: [Errno 38] Function not implemented

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module>
    main()
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
    runner.train()
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1728, in train
    self._train_loop = self.build_train_loop(
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1520, in build_train_loop
    loop = LOOPS.build(
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__
    dataloader = runner.build_dataloader(
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1370, in build_dataloader
    dataset = DATASETS.build(dataset_cfg)
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 298, in process_hf_dataset
    return process(**kwargs)
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 167, in process
    dataset = build_origin_dataset(dataset, split)
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 30, in build_origin_dataset
    dataset = BUILDER.build(dataset)
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/datasets/load.py", line 2265, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/datasets/builder.py", line 418, in __init__
    with FileLock(lock_path):
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_api.py", line 297, in __enter__
    self.acquire()
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_api.py", line 255, in acquire
    self._acquire()
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_unix.py", line 48, in _acquire
    raise NotImplementedError(msg) from exception
NotImplementedError: FileSystem does not appear to support flock; user SoftFileLock instead
Exception ignored in atexit callback: <function matmul_ext_update_autotune_table at 0x7f4f48910790>
Traceback (most recent call last):
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel)
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 150, in _update_autotune_table
    cache_manager.put(autotune_table)
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 66, in put
    with FileLock(self.lock_path):
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_api.py", line 297, in __enter__
    self.acquire()
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_api.py", line 255, in acquire
    self._acquire()
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_unix.py", line 48, in _acquire
    raise NotImplementedError(msg) from exception
NotImplementedError: FileSystem does not appear to support flock; user SoftFileLock instead
Exception ignored in atexit callback: <function matmul_ext_update_autotune_table at 0x7f43efc79750>
Traceback (most recent call last):
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel)
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 150, in _update_autotune_table
    cache_manager.put(autotune_table)
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 66, in put
    with FileLock(self.lock_path):
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_api.py", line 297, in __enter__
    self.acquire()
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_api.py", line 255, in acquire
    self._acquire()
  File "/public/home/xxx/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_unix.py", line 48, in _acquire
    raise NotImplementedError(msg) from exception
NotImplementedError: FileSystem does not appear to support flock; user SoftFileLock instead.

"看起来您遇到的问题是 fcntl.flock 不被您的文件系统支持。" This explanation comes from GPT4.

@har77774 har77774 changed the title How to solve the following problem with xtuner on Slurn? How to solve the following problem with xtuner on Slurm? May 9, 2024
@LZHgrla
Copy link
Collaborator

LZHgrla commented May 10, 2024

@har77774 Hi!
It is a known problem of datasets, and here are some related issues huggingface/datasets#6505, huggingface/datasets#2618.

It seems that the method on this comment, huggingface/datasets#6505 (comment), can solve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants