-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot run on v4-16 worker 0 TPU VM: "Failed to get global TPU topology" #25
Comments
Hi @markusheimerl, thanks for the issue! It looks like you are setting It looks to me like this is a Torch XLA issue. It is possible that this can be fixed by using a newer version of the base container here. If not, maybe we need to put in an issue with Torch XLA. What I would recommend first is trying |
Hi @markusheimerl , it seems you are using v4-16 TPU which 2 host VMs. This multi-host setup is currently not supported. To test it on TPU, I suggest you try to run it on v4-8 / v5e-8 which is a single-host TPU architecture and has 1 VM. You should be able to run the command on v4-8 / v5e-8 out-of-the-box. |
Hi @michaelmoynihan, I also have the
So I made the Docker file with contents
Then I ran this: sudo docker build -t my-tpu-pytorch-image .
sudo docker run -v /home/me/finetune:/workspace my-tpu-pytorch-image python /workspace/train.py where train.py is this script for training gemma7b https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py
There is again the same error |
I tried to run this script on tpu v3-8 and with slight modifications of the script (I lowered the model to Gemma-2b - because of resource_exhausted bug) could start my script with command (without docker)
|
The script is working, looks like i was using wrong vm version when creating TPU, and I forgot about setting environment variables
and add this env var
|
The text was updated successfully, but these errors were encountered: