Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support of DDP and CompiledAutograd. #319

Open
wants to merge 5 commits into
base: gh/fegin/2/base
Choose a base branch
from

Conversation

fegin
Copy link
Contributor

@fegin fegin commented May 9, 2024

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]
fegin added a commit that referenced this pull request May 9, 2024
ghstack-source-id: b2ebd1394a28bb11756ac4fbe0e23368bf6a1864
Pull Request resolved: #319
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 9, 2024
[ghstack-poisoned]
fegin added a commit that referenced this pull request May 9, 2024
ghstack-source-id: f7a29c2685777e618e00c01397990ff0a8f001df
Pull Request resolved: #319
fegin added a commit that referenced this pull request May 9, 2024
ghstack-source-id: 33a5fdf97d5f0f5063858b1d35f6785ec8a58b13
Pull Request resolved: #319
fegin added a commit that referenced this pull request May 9, 2024
ghstack-source-id: f4b9c10f8dc61f5640176f25213bbcd0fbe6ce97
Pull Request resolved: #319
@fegin fegin requested review from wanchaol and wconstab May 9, 2024 22:10
[ghstack-poisoned]
fegin added a commit that referenced this pull request May 9, 2024
ghstack-source-id: fe96fd5280d334cb9529273fbb2643b691414aa4
Pull Request resolved: #319
help="Data Parallelism degree. -1 means leftover ranks will be used (After SP/PP). 1 means disabled.",
help="Data Parallelism degree (FSDP). -1 means leftover ranks will be used (After SP/PP/replicate). 1 means disabled.",
)
self.parser.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a different suggestion here after some thoughts:

  • we should keep the data_parallel_degree to be used by all data parallel
  • we should add a dp_mode training arg that distinguish whether to apply DDP/fSDP/HSDP, instead of data_parallel_replicate_degree
  • dp_degree -> int/tuple[int], when it's tuple of int, it must be hsdp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here. I also thought about using mode as well. If that makes sense to people, I can change it to that.

help="Whether to compile the model.",
)
self.parser.add_argument(
"--training.compiled_autograd",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be added to the experimental space IMO

@@ -56,6 +63,10 @@ def build_mesh(self, device_type):
def dp_enabled(self):
return self.dp > 1

@property
def dp_replicate_enabled(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: the comments should be addressed together.
have dp_mode instead and reuse the dp_degree

@@ -0,0 +1,40 @@
# TorchTrain Config.toml
[job]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's no official 1b model size for both llama 2/3 release, and the toml files are user facing, It would be better if we only add released model sizes.

@@ -144,6 +195,8 @@ def parallelize_llama(model, world_mesh, parallel_dims, job_config: JobConfig):
raise NotImplementedError(
"fused_rmsnorm not yet compatible with TP. Please use layernorm or rmsnorm."
)
if parallel_dims.dp_replicate_enabled:
raise NotImplementedError("DDP/HSDP + TP are not supported yet.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make DDP + TP work and see if it could support llama3_8b or llama2_7b. If not, we could try to import other models instead of Llama, and have DDP to apply to that model instead :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants