Fix error to load data at the correct position when resuming from a checkpoint #2520

PC91 · 2023-11-19T19:31:55Z

This PR contains a mechanism to resume a training from the positions in corpora. The idea is to use a cursor for each corpus and save its text line (the batch variable cid_line_number) to the saved checkpoint file.

The following features are implemented:

Adding a new parameter resume_from_corpora: when True, the training will try to resume from the last text line of each corpus. Otherwise, the training will resume from the beginning of all corpora.
Update the calculation of cid_line_number to get the text line number directly from the exfile_open function.
Conditions to resume the training from the saved text lines:
- The last text lines of all corpora must be saved in the checkpoint (for backward compatibility with existing versions.)
- All corpus names in the config and in the saved checkpoint must match.
- Quick checksum: for each corpus in the config, its saved text line cannot exceed its total number of lines.
Communication between the trainer and model saver to handle corpus cursors.

The following scenarios are tested:

Backward compatibility test: resume from beginning when using a checkpoint of existing version (with no saved text line.)
Resume from a saved checkpoint with saved text lines :
- When resume_from_corpora=True
  - Some corpora in the config do not match (resume from beginning.)
  - Some saved text lines exceed the total number of text line (resume from beginning.)
  - All check are passed (resume from saved text file.)
- When resume_from_corpora=False (resume from beginning.)

vince62s · 2023-11-20T07:16:21Z

This is doing the same thing as what is described here: #2006 (comment)
the issue is that if checkpoint is at 250 000 steps and you want to continue it takes way too long to iterate over those batches. THis is the reason why memorizing the index of each dataset and setting the cursor at this index is more efficient.

PC91 · 2024-03-31T20:45:13Z

This is doing the same thing as what is described here: #2006 (comment) the issue is that if checkpoint is at 250 000 steps and you want to continue it takes way too long to iterate over those batches. THis is the reason why memorizing the index of each dataset and setting the cursor at this index is more efficient.

Thanks @vince62s! The code is updated. Could you have a look and merge to the main code base ?

PC91 force-pushed the datagen-from-checkpoint branch 2 times, most recently from cd4d637 to 3c4b7b7 Compare November 19, 2023 20:09

PC91 marked this pull request as draft January 7, 2024 02:27

PC91 force-pushed the datagen-from-checkpoint branch from 3c4b7b7 to 0a06542 Compare January 7, 2024 02:28

PC91 force-pushed the datagen-from-checkpoint branch from 0a06542 to 874efcc Compare March 14, 2024 21:20

PC91 force-pushed the datagen-from-checkpoint branch 18 times, most recently from d191392 to e2093e2 Compare March 31, 2024 20:25

Load data at the correct position when resuming from a checkpoint

b422cfc

PC91 force-pushed the datagen-from-checkpoint branch from e2093e2 to b422cfc Compare March 31, 2024 20:34

PC91 marked this pull request as ready for review March 31, 2024 20:44

PC91 mentioned this pull request May 4, 2024

How to reproduce the same training process when using "train_from" #2006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix error to load data at the correct position when resuming from a checkpoint #2520

Fix error to load data at the correct position when resuming from a checkpoint #2520

PC91 commented Nov 19, 2023 •

edited

vince62s commented Nov 20, 2023

PC91 commented Mar 31, 2024 •

edited

Fix error to load data at the correct position when resuming from a checkpoint #2520

Are you sure you want to change the base?

Fix error to load data at the correct position when resuming from a checkpoint #2520

Conversation

PC91 commented Nov 19, 2023 • edited

vince62s commented Nov 20, 2023

PC91 commented Mar 31, 2024 • edited

PC91 commented Nov 19, 2023 •

edited

PC91 commented Mar 31, 2024 •

edited