-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support of DDP and CompiledAutograd. #319
base: gh/fegin/2/base
Are you sure you want to change the base?
Conversation
ghstack-source-id: b2ebd1394a28bb11756ac4fbe0e23368bf6a1864 Pull Request resolved: #319
ghstack-source-id: f7a29c2685777e618e00c01397990ff0a8f001df Pull Request resolved: #319
[ghstack-poisoned]
ghstack-source-id: 33a5fdf97d5f0f5063858b1d35f6785ec8a58b13 Pull Request resolved: #319
[ghstack-poisoned]
ghstack-source-id: f4b9c10f8dc61f5640176f25213bbcd0fbe6ce97 Pull Request resolved: #319
ghstack-source-id: fe96fd5280d334cb9529273fbb2643b691414aa4 Pull Request resolved: #319
help="Data Parallelism degree. -1 means leftover ranks will be used (After SP/PP). 1 means disabled.", | ||
help="Data Parallelism degree (FSDP). -1 means leftover ranks will be used (After SP/PP/replicate). 1 means disabled.", | ||
) | ||
self.parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different suggestion here after some thoughts:
- we should keep the data_parallel_degree to be used by all data parallel
- we should add a dp_mode training arg that distinguish whether to apply DDP/fSDP/HSDP, instead of
data_parallel_replicate_degree
- dp_degree -> int/tuple[int], when it's tuple of int, it must be hsdp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion here. I also thought about using mode
as well. If that makes sense to people, I can change it to that.
help="Whether to compile the model.", | ||
) | ||
self.parser.add_argument( | ||
"--training.compiled_autograd", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be added to the experimental space IMO
@@ -56,6 +63,10 @@ def build_mesh(self, device_type): | |||
def dp_enabled(self): | |||
return self.dp > 1 | |||
|
|||
@property | |||
def dp_replicate_enabled(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: the comments should be addressed together.
have dp_mode
instead and reuse the dp_degree
@@ -0,0 +1,40 @@ | |||
# TorchTrain Config.toml | |||
[job] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's no official 1b model size for both llama 2/3 release, and the toml files are user facing, It would be better if we only add released model sizes.
@@ -144,6 +195,8 @@ def parallelize_llama(model, world_mesh, parallel_dims, job_config: JobConfig): | |||
raise NotImplementedError( | |||
"fused_rmsnorm not yet compatible with TP. Please use layernorm or rmsnorm." | |||
) | |||
if parallel_dims.dp_replicate_enabled: | |||
raise NotImplementedError("DDP/HSDP + TP are not supported yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make DDP + TP work and see if it could support llama3_8b or llama2_7b. If not, we could try to import other models instead of Llama, and have DDP to apply to that model instead :)
Stack from ghstack (oldest at bottom):