Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help needed to enable TPU Training #630

Closed
ZDisket opened this issue Jul 26, 2021 · 11 comments
Closed

Help needed to enable TPU Training #630

ZDisket opened this issue Jul 26, 2021 · 11 comments
Assignees
Labels
bug 🐛 Something isn't working wontfix

Comments

@ZDisket
Copy link
Collaborator

ZDisket commented Jul 26, 2021

I've been trying to get TensorflowTTS to train on Cloud TPUs because they're really fast and easy to access with the TRC, starting with MB-MelGAN+HiFi-GAN discriminator. I've already implemented all changes, including dataloader overhauls to use TFRecords and Google Cloud required here. When I try to train, however, I get this cryptic error, both in TF 2.5.0 and nightly (I didn't use TF 2.3.1 because it allocates something wrongly to the CPU causing another error).

         [[cond_1]]
         [[TPUReplicate/_compile/_10135486412832257275/_4]]
         [[TPUReplicate/_compile/_10135486412832257275/_4/_76]]
  (4) Invalid argument: {{function_node __inference__one_step_forward_179257}} Output shapes of then and else branches do not match: (f32[64,<=8192], f32[64,<=8192]) vs. (f32[64,<=8192], f32[0])

[64,<=8192] are [batch_size, batch_max_steps]
Here's the full training log:
train_log.txt
I can't figure out what causes this issue, no matter what I try. Any idea? Being able to train on TPUs would be really beneficial and within reach. I can provide specific instructions to replicate the issue, but it requires a Google Cloud with storage even if using Colab TPU (Tensorflow 2.x refuses to save and load data from local filesystem when using TPU). The same code, including TFRecord dataloader, trains fine on GPU.

@dathudeptrai dathudeptrai self-assigned this Jul 26, 2021
@dathudeptrai dathudeptrai added the bug 🐛 Something isn't working label Jul 26, 2021
@dathudeptrai
Copy link
Collaborator

dathudeptrai commented Jul 26, 2021

@ZDisket so the bug come from tf.data ?

@dathudeptrai
Copy link
Collaborator

@dathudeptrai
Copy link
Collaborator

@ZDisket
Copy link
Collaborator Author

ZDisket commented Jul 26, 2021

@dathudeptrai

so the bug come from tf.data ?

Late reply because I posted the issue then went to sleep. I iterated over the dataset with this function.

    @tf.function        
    def iter_dts(self):
      dist_dataset_iterator = iter(self.train_data_loader)
      step_num = 250
      for _ in range(step_num):
        print(_)

and ran it just before self.run() in the GanBasedTrainer. Nothing happened. Looks like it's something in the training. According to the one issue where this problem is mentioned, the person calls it a "a horrible bug deep down in the XLA compiler", but it should've been fixed after TF2.2.

@ZDisket
Copy link
Collaborator Author

ZDisket commented Jul 26, 2021

Also, I am getting a separate error when training Tacotron2. This one might be easier to solve.

  (0) Invalid argument: {{function_node __inference__one_step_forward_40494}} Input 1 to node `tacotron2/input_sequence_masks/Range` with op Range must be a compile-time constant.

XLA compilation requires that operator arguments that represent shapes or dimensions be evaluated to concrete values at compile time. This error means that a shape or dimension argument could not be evaluated at compile time, usually because the value of the argument depends on a parameter to the computation, on a variable, or on a stateful operation such as a random number generator.

         [[{{node tacotron2/input_sequence_masks/Range}}]]
         [[TPUReplicate/_compile/_15180876221625178187/_2]]
         [[cluster__one_step_forward/control_after/_1/_103]]
  (1) Invalid argument: {{function_node __inference__one_step_forward_40494}} Input 1 to node `tacotron2/input_sequence_masks/Range` with op Range must be a compile-time constant.

taclog.txt

@dathudeptrai
Copy link
Collaborator

@ZDisket
Copy link
Collaborator Author

ZDisket commented Jul 28, 2021

@dathudeptrai I previously tried removing the collater and making it pad to the longest audio len, and it still failed.
Just to make sure, I tried again with only change being removing that if statement and it's the same. Perhaps the Tacotron issue might be more worthy of looking into?

@dathudeptrai
Copy link
Collaborator

@ZDisket i will pin the issue. I have no idea about TPU since i never use it :D

@dathudeptrai dathudeptrai pinned this issue Jul 29, 2021
@ZDisket
Copy link
Collaborator Author

ZDisket commented Jul 30, 2021

It seems that the people over at TensorflowASR already have TPU support and ran into problems in the past as well - might be worth looking into: TensorSpeech/TensorFlowASR#100

@ZDisket
Copy link
Collaborator Author

ZDisket commented Aug 11, 2021

I got Tacotron2 to start training on Colab TPU by decorating BasedTrainer.run with @tf.function and removing experimental_relax_shapes from tacotron2.inference, at 1.4s/it on a TPU v2-8 (about 4x faster than a Tesla V100). Right now it can't save intermediate results because .numpy() is not supported in graph mode, nor save checkpoints (except for when the training is forcibly interrupted). Also TensorBoard makes logfiles but doesn't write anything to them resulting in 40 byte empty ones.

However, on a small dataset with about 2.6k train elements it mysteriously gets stopped at 275 steps by something sending a CTRL + C, at least in Colab. (I think it's just Colab instance running out of memory)
[train]: 0% 275/200000 [07:00<77:58:43, 1.41s/it]^C

As I read, the TensorFlowASR guys fixed a lot of issues by forgetting about custom loops and instead using Keras built-in .fit()

@stale
Copy link

stale bot commented Oct 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix label Oct 12, 2021
@stale stale bot closed this as completed Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working wontfix
Projects
None yet
Development

No branches or pull requests

2 participants