Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DeviceMesh] Supported N groups in from_group #126258

Closed
wants to merge 9 commits into from

Conversation

awgu
Copy link
Contributor

@awgu awgu commented May 15, 2024

Stack from ghstack (oldest at bottom):

Overview
This PR supports constructing an ND mesh with from_group() by passing in group: List[ProcessGroup] and mesh: Union[torch.Tensor, "ArrayLike"] together. The ndim of the device mesh returned from from_group() is equal to the number of ProcessGroups passed. If the ndim is greater than 1, then the mesh argument is required (since there is no simple way to recover the mesh tensor from the process groups otherwise).

This PR also adds mesh_dim_names as an argument to forward to the device mesh for convenience.

Old Approach

Overview

  • This PR mainly adds mesh_shape to from_group() so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to from_group() with mesh_shape = (replicate_dim_size, shard_dim_size) and from_group() will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
    • Constructing the 2D DeviceMesh from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
  • This PR also adds mesh_dim_names to from_group() so that the user can name the mesh dimensions of the constructed device mesh.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Copy link

pytorch-bot bot commented May 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126258

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit d757bd4 with merge base 91bf952 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels May 15, 2024
awgu added a commit that referenced this pull request May 15, 2024
ghstack-source-id: 7b4fd9ebe36785533e60d1bbfea31fdc1b7479af
Pull Request resolved: #126258
…group`"


- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user.
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dims` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
awgu added a commit that referenced this pull request May 15, 2024
ghstack-source-id: 9b0f4ed648a6ffad4414c036ed05ea1099f362e4
Pull Request resolved: #126258
…group`"


- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user.
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dims` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
awgu added a commit that referenced this pull request May 15, 2024
ghstack-source-id: d69f7f205f9c189409a8c76ae3d29b72aa11130d
Pull Request resolved: #126258
@awgu awgu requested review from wconstab, wanchaol and wz337 May 15, 2024 14:56
@awgu awgu marked this pull request as ready for review May 15, 2024 14:56
@awgu awgu requested a review from donglimm May 15, 2024 15:18
…group`"


**Overview**
- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.



cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
awgu added a commit that referenced this pull request May 15, 2024
ghstack-source-id: 29f8816f6a80183c441b019784918b967fef187e
Pull Request resolved: #126258
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks awesome!

torch/distributed/device_mesh.py Outdated Show resolved Hide resolved
torch/distributed/device_mesh.py Outdated Show resolved Hide resolved
…group`"


**Overview**
- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.



cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
awgu added a commit that referenced this pull request May 15, 2024
ghstack-source-id: 1a9b841498bd94cc25498358777dd1a9b7cba0f3
Pull Request resolved: #126258
…group`"


**Overview**
- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.



cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
awgu added a commit that referenced this pull request May 15, 2024
ghstack-source-id: c74b058293bd07de7d1150780e7a6d31388870a0
Pull Request resolved: #126258

# Use the global PG as the parent group (in practice, this could be a
# subgroup of the global PG)
dp_group = dist.distributed_c10d._get_default_group()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is that the trainer is still able to construct this overall DP process group. Then, constructing the mesh arg from there can be done like below.

…group`"


**Overview**
This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise).

This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience.

<details>
<summary> Old Approach </summary>

**Overview**
- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.

</details>

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
awgu added a commit that referenced this pull request May 15, 2024
ghstack-source-id: 902775a48b7350058ea477652677f2a9e6154082
Pull Request resolved: #126258
@awgu awgu changed the title [DeviceMesh] Added mesh_shape/mesh_dim_names to from_group [DeviceMesh] Supported N groups in from_group May 15, 2024
@awgu awgu requested a review from wanchaol May 15, 2024 21:04
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new approach sounds good to me!

@awgu awgu added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label May 15, 2024
@wanchaol wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label May 15, 2024
@awgu
Copy link
Contributor Author

awgu commented May 16, 2024

NotImplementedError: Could not run '_c10d_functional_autograd::all_to_all_single' with arguments from the 'CUDA' backend.

periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 2, linux.rocm.gpu) failure is known and unrelated. #126380

@awgu
Copy link
Contributor Author

awgu commented May 16, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: periodic / linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1, linux.g5.12xlarge.nvidia.gpu), periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 2, linux.rocm.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x bb745634b2752c3c27c747fb74736ae13d23b6b4 returned non-zero exit code 1

Auto-merging test/distributed/_composable/fsdp/test_fully_shard_init.py
CONFLICT (content): Merge conflict in test/distributed/_composable/fsdp/test_fully_shard_init.py
Auto-merging test/distributed/test_device_mesh.py
CONFLICT (content): Merge conflict in test/distributed/test_device_mesh.py
Auto-merging torch/distributed/device_mesh.py
error: could not apply bb745634b27... [DeviceMesh] Added `mesh`/`mesh_dim_names` to `from_group`
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
Details for Dev Infra team Raised by workflow job

**Overview**
This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise).

This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience.

<details>
<summary> Old Approach </summary>

**Overview**
- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.

</details>

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
awgu added a commit that referenced this pull request May 16, 2024
ghstack-source-id: bb94cd7d13d4c2c4fea0033710dacd51dbc341e4
Pull Request resolved: #126258
**Overview**
This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise).

This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience.

<details>
<summary> Old Approach </summary>

**Overview**
- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.

</details>

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
awgu added a commit that referenced this pull request May 16, 2024
ghstack-source-id: e4587a5d259fd1be4af7f566e5ddc0cb6fd12e97
Pull Request resolved: #126258
@awgu
Copy link
Contributor Author

awgu commented May 16, 2024

distributed/test_c10d_nccl.py::ProcessGroupNCCLGroupTest::test_nan_assert_float16

linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1, linux.g5.12xlarge.nvidia.gpu) failure unrelated


linux-focal-rocm6.1-py3.8 / build failure unrelated

@awgu
Copy link
Contributor Author

awgu commented May 16, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: periodic / linux-focal-rocm6.1-py3.8 / build, periodic / linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1, linux.g5.12xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm6.1-py3.8 / test (default, 2, 2, linux.rocm.gpu)

Details for Dev Infra team Raised by workflow job

"""
Contstructs a :class:`DeviceMesh` with ``device_type`` from an
existing :class:`ProcessGroup`.

The constructed device mesh is assumed to be 1D.
The constructed device mesh has number of dimensions equal to the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any requirement for the input pg groups,
The test case test_2d_process_group_init first builds a global pg, then builds dp_shard_group, dp_replicate_group based on the global pg, is this required to always build global pg first ?
Or can these two process groups be created independently ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two process groups can be constructed independently.

@awgu
Copy link
Contributor Author

awgu commented May 17, 2024

@awgu
Copy link
Contributor Author

awgu commented May 17, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: trunk / linux-focal-rocm6.1-py3.8 / test (default, 2, 2, linux.rocm.gpu), periodic / linux-focal-rocm6.1-py3.8 / build, periodic / linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1, linux.g5.12xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024
**Overview**
This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise).

This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience.

<details>
<summary> Old Approach </summary>

**Overview**
- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.

</details>

Pull Request resolved: pytorch#126258
Approved by: https://github.com/wanchaol
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: DeviceMesh
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants