How to reproduce the same training process when using "train_from" #2006

lemon234071 · 2021-02-02T03:21:56Z

Dear,

When the model training was forced to stop due to an accident. I use the opt "train_from" to continue training from the checkpoint. But the result is different from the tranining from start to finish without stopping:

The stored patience for "early stop" was not saved into checkpoint.
The order of data batch provied train_iter is different, when train_from a checkpoint. （When train_from, it starts over from the begining of the dataset and the data are very different from where it stands at the step of saved checkpoint）

Note that i fixed all random seed.

So it is very convenient that If a reproduction mechanism can be added into the code base.
Any help will be greatly appreciated.

francoishernandez · 2021-02-02T08:29:32Z

This is roughly what I intended with #1826, but it's not compatible with all the changes we introduced in 2.0.
It should be possible to introduce such a mechanism though, that would store some counter to keep track of where we are in each dataset. It would never be perfect though, as there is quite a gap between when the data is read and when it's indeed seen in a training batch, because of the pooling mechanism.

lemon234071 · 2021-02-02T11:23:59Z

Thanks for your efforts.

vince62s · 2023-01-27T14:21:00Z

Regarding this issue I implemented the following:
new option -dryrun_steps xxxxx
which would batch during xxxxx steps without actually training, then start trianing at xxxxx+1
That would restart the training at the exact point in the data where it stopped.
The only issue is that it is very very slow to reach xxxxx+1
If there is a better idea other than storing the index in each dataset.

PC91 · 2024-05-04T22:33:06Z

I made an attempt for this feature in this PR #2520. The idea is to skip to the saved text line in each corpus when a training is resumed.

francoishernandez added contributions welcome type:enhancement labels Feb 2, 2021

This was referenced Nov 17, 2023

Data generation when resuming from a checkpoint #2517

Closed

Fix error to load data at the correct position when resuming from a checkpoint #2520

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reproduce the same training process when using "train_from" #2006

How to reproduce the same training process when using "train_from" #2006

lemon234071 commented Feb 2, 2021 •

edited

francoishernandez commented Feb 2, 2021

lemon234071 commented Feb 2, 2021

vince62s commented Jan 27, 2023

PC91 commented May 4, 2024

How to reproduce the same training process when using "train_from" #2006

How to reproduce the same training process when using "train_from" #2006

Comments

lemon234071 commented Feb 2, 2021 • edited

francoishernandez commented Feb 2, 2021

lemon234071 commented Feb 2, 2021

vince62s commented Jan 27, 2023

PC91 commented May 4, 2024

lemon234071 commented Feb 2, 2021 •

edited