Add support of DDP and CompiledAutograd. #319

fegin · 2024-05-09T21:57:42Z

Stack from ghstack (oldest at bottom):

-> Add support of DDP and CompiledAutograd. #319

[ghstack-poisoned]

ghstack-source-id: b2ebd1394a28bb11756ac4fbe0e23368bf6a1864 Pull Request resolved: #319

[ghstack-poisoned]

ghstack-source-id: f7a29c2685777e618e00c01397990ff0a8f001df Pull Request resolved: #319

[ghstack-poisoned]

ghstack-source-id: 33a5fdf97d5f0f5063858b1d35f6785ec8a58b13 Pull Request resolved: #319

[ghstack-poisoned]

ghstack-source-id: f4b9c10f8dc61f5640176f25213bbcd0fbe6ce97 Pull Request resolved: #319

[ghstack-poisoned]

ghstack-source-id: fe96fd5280d334cb9529273fbb2643b691414aa4 Pull Request resolved: #319

wanchaol · 2024-05-10T05:57:28Z

torchtitan/config_manager.py

-            help="Data Parallelism degree. -1 means leftover ranks will be used (After SP/PP). 1 means disabled.",
+            help="Data Parallelism degree (FSDP). -1 means leftover ranks will be used (After SP/PP/replicate). 1 means disabled.",
+        )
+        self.parser.add_argument(


I have a different suggestion here after some thoughts:

we should keep the data_parallel_degree to be used by all data parallel

we should add a dp_mode training arg that distinguish whether to apply DDP/fSDP/HSDP, instead of data_parallel_replicate_degree

dp_degree -> int/tuple[int], when it's tuple of int, it must be hsdp

I don't have a strong opinion here. I also thought about using mode as well. If that makes sense to people, I can change it to that.

wanchaol · 2024-05-10T05:58:35Z

torchtitan/config_manager.py

+            help="Whether to compile the model.",
+        )
+        self.parser.add_argument(
+            "--training.compiled_autograd",


this should be added to the experimental space IMO

wanchaol · 2024-05-10T05:59:33Z

torchtitan/parallelisms/__init__.py

@@ -56,6 +63,10 @@ def build_mesh(self, device_type):
    def dp_enabled(self):
        return self.dp > 1

+    @property
+    def dp_replicate_enabled(self):


ditto: the comments should be addressed together.
have dp_mode instead and reuse the dp_degree

wanchaol · 2024-05-10T06:03:45Z

train_configs/llama_1b.toml

@@ -0,0 +1,40 @@
+# TorchTrain Config.toml
+[job]


Since there's no official 1b model size for both llama 2/3 release, and the toml files are user facing, It would be better if we only add released model sizes.

wanchaol · 2024-05-10T06:06:36Z

torchtitan/parallelisms/parallelize_llama.py

@@ -144,6 +195,8 @@ def parallelize_llama(model, world_mesh, parallel_dims, job_config: JobConfig):
            raise NotImplementedError(
                "fused_rmsnorm not yet compatible with TP. Please use layernorm or rmsnorm."
            )
+        if parallel_dims.dp_replicate_enabled:
+            raise NotImplementedError("DDP/HSDP + TP are not supported yet.")


We should make DDP + TP work and see if it could support llama3_8b or llama2_7b. If not, we could try to import other models instead of Llama, and have DDP to apply to that model instead :)

Update

5864e40

[ghstack-poisoned]

fegin added a commit that referenced this pull request May 9, 2024

Add support of DDP and CompiledAutograd.

35965b2

ghstack-source-id: b2ebd1394a28bb11756ac4fbe0e23368bf6a1864 Pull Request resolved: #319

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 9, 2024

Update

23cfe17

[ghstack-poisoned]

fegin added a commit that referenced this pull request May 9, 2024

Add support of DDP and CompiledAutograd.

3670538

ghstack-source-id: f7a29c2685777e618e00c01397990ff0a8f001df Pull Request resolved: #319

Update on "Add support of DDP and CompiledAutograd."

619d415

[ghstack-poisoned]

fegin added a commit that referenced this pull request May 9, 2024

Add support of DDP and CompiledAutograd.

255fd46

ghstack-source-id: 33a5fdf97d5f0f5063858b1d35f6785ec8a58b13 Pull Request resolved: #319

Update on "Add support of DDP and CompiledAutograd."

eece7d5

[ghstack-poisoned]

fegin added a commit that referenced this pull request May 9, 2024

Add support of DDP and CompiledAutograd.

aa08c80

ghstack-source-id: f4b9c10f8dc61f5640176f25213bbcd0fbe6ce97 Pull Request resolved: #319

fegin requested review from wanchaol and wconstab May 9, 2024 22:10

Update

5fbc5b6

[ghstack-poisoned]

fegin added a commit that referenced this pull request May 9, 2024

Add support of DDP and CompiledAutograd.

8cf0459

ghstack-source-id: fe96fd5280d334cb9529273fbb2643b691414aa4 Pull Request resolved: #319

wanchaol reviewed May 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support of DDP and CompiledAutograd. #319

Add support of DDP and CompiledAutograd. #319

fegin commented May 9, 2024 •

edited

wanchaol May 10, 2024

fegin May 10, 2024

wanchaol May 10, 2024

wanchaol May 10, 2024

wanchaol May 10, 2024

wanchaol May 10, 2024

Add support of DDP and CompiledAutograd. #319

Are you sure you want to change the base?

Add support of DDP and CompiledAutograd. #319

Conversation

fegin commented May 9, 2024 • edited

wanchaol May 10, 2024

Choose a reason for hiding this comment

fegin May 10, 2024

Choose a reason for hiding this comment

wanchaol May 10, 2024

Choose a reason for hiding this comment

wanchaol May 10, 2024

Choose a reason for hiding this comment

wanchaol May 10, 2024

Choose a reason for hiding this comment

wanchaol May 10, 2024

Choose a reason for hiding this comment

fegin commented May 9, 2024 •

edited