Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reproduce the same training process when using "train_from" #2006

Open
lemon234071 opened this issue Feb 2, 2021 · 4 comments
Open

Comments

@lemon234071
Copy link

lemon234071 commented Feb 2, 2021

Dear,

When the model training was forced to stop due to an accident. I use the opt "train_from" to continue training from the checkpoint. But the result is different from the tranining from start to finish without stopping:

  1. The stored patience for "early stop" was not saved into checkpoint.
  2. The order of data batch provied train_iter is different, when train_from a checkpoint. (When train_from, it starts over from the begining of the dataset and the data are very different from where it stands at the step of saved checkpoint)

Note that i fixed all random seed.

So it is very convenient that If a reproduction mechanism can be added into the code base.
Any help will be greatly appreciated.

@francoishernandez
Copy link
Member

This is roughly what I intended with #1826, but it's not compatible with all the changes we introduced in 2.0.
It should be possible to introduce such a mechanism though, that would store some counter to keep track of where we are in each dataset. It would never be perfect though, as there is quite a gap between when the data is read and when it's indeed seen in a training batch, because of the pooling mechanism.

@lemon234071
Copy link
Author

Thanks for your efforts.

@vince62s
Copy link
Member

Regarding this issue I implemented the following:
new option -dryrun_steps xxxxx
which would batch during xxxxx steps without actually training, then start trianing at xxxxx+1
That would restart the training at the exact point in the data where it stopped.
The only issue is that it is very very slow to reach xxxxx+1
If there is a better idea other than storing the index in each dataset.

@PC91
Copy link
Contributor

PC91 commented May 4, 2024

I made an attempt for this feature in this PR #2520. The idea is to skip to the saved text line in each corpus when a training is resumed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants