New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Train] Add example of pre-training Llama model on Intel Gaudi #45459

Open

harborn wants to merge 3 commits into ray-project:master from harborn:pretrain-example-hpu

Contributor

harborn commented May 21, 2024 •

edited

Why are these changes needed?

To leverage the potential of Intel Gaudi accelerator, we extend Ray Train's capabilities by adding support for Intel Gaudi (HPU) hardware. This PR include an example for pre-training Llama-7b on multi HPUs.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

harborn requested review from matthewdeng, justinvyu, woshiyyya and a team as code owners

May 21, 2024 08:55

aslonnie reviewed

View reviewed changes

Collaborator

aslonnie left a comment

(for train team folks to review)

anyscalesam added triage train labels

harborn force-pushed the pretrain-example-hpu branch from 7ffb29c to 30f43b6 Compare

May 30, 2024 02:15

harborn requested review from hongpeng-guo and raulchen as code owners

May 30, 2024 02:15

harborn added 3 commits

May 30, 2024 10:01


          add pretrain example

71feddd

Signed-off-by: Wu, Gangsheng <gangsheng.wu@intel.com>


          update notebook

51b5a74

Signed-off-by: Wu, Gangsheng <gangsheng.wu@intel.com>


          update examples.yml

30f43b6

Signed-off-by: Wu, Gangsheng <gangsheng.wu@intel.com>

woshiyyya self-assigned this

woshiyyya reviewed

View reviewed changes

doc/source/train/examples/intel_gaudi/llama_pretrain.ipynb

		@@ -0,0 +1,568 @@
		{

Member

woshiyyya May 31, 2024

High-level question: seems that it's using deepspeed zero-3 for pre-training. why we also include megatron here?

Contributor Author

harborn Jun 3, 2024

this example import megatron just for data processing

woshiyyya reviewed

View reviewed changes

doc/source/train/examples/intel_gaudi/llama_pretrain.ipynb

+                  "            )\n",
+                  "\n",
+                  "        # Data loader only on rank 0 of each model parallel group.\n",
+                  "        if args.use_dataset_only or mpu.get_tensor_model_parallel_rank() == 0:\n",

Member

woshiyyya May 31, 2024

Where did we configure the tensor and pipeline parallel group size?

woshiyyya reviewed

View reviewed changes

doc/source/train/examples/intel_gaudi/llama_pretrain.ipynb

+                  "(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.9e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.42, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}\n",
+                  "(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.875e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.4, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}\n",
+                  "(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.85e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.4, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}\n",
+                  "(RayTrainWorker pid=339380) {'loss': nan, 'grad_norm': nan, 'learning_rate': 4.825e-05, 'epoch': 0.0, 'memory_allocated (GB)': 40.45, 'max_memory_allocated (GB)': 93.68, 'total_memory_available (GB)': 94.62}\n",

Member

woshiyyya May 31, 2024 •

edited

seems that the loss and grad norm are nan, can you try to fix the bug?

woshiyyya reviewed

View reviewed changes

doc/source/train/examples/intel_gaudi/llama_pretrain.ipynb

+                  "\n",
+                  "    # Set backend to hccl in TorchConfig\n",
+                  "    torch_config = TorchConfig(backend=\"hccl\")\n",
+                  "    runtime_env = {\n",

Member

woshiyyya May 31, 2024

Do we need this if it's empty

woshiyyya reviewed

View reviewed changes

doc/source/train/examples/intel_gaudi/llama_pretrain.ipynb

+                 "cell_type": "markdown",
+                 "metadata": {},
+                 "source": [
+                  "## Process dataset to dataloader"

Member

woshiyyya May 31, 2024 •

edited

In general, can we introduce the model sharding layouts, so that the users can better understand why the dataloader is defined this?

Including pp/tp/dp group size, global batch size, per device batch size, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

woshiyyya woshiyyya left review comments

aslonnie aslonnie left review comments

matthewdeng Awaiting requested review from matthewdeng matthewdeng is a code owner

justinvyu Awaiting requested review from justinvyu justinvyu is a code owner

hongpeng-guo Awaiting requested review from hongpeng-guo hongpeng-guo is a code owner

raulchen Awaiting requested review from raulchen raulchen is a code owner

At least 1 approving review is required to merge this pull request.