The more I pretrain (SSL), the worse fine-tuned model gets? #9175

riqiang-dp · 2024-05-13T01:31:06Z

riqiang-dp
May 13, 2024

Hi, I'm trying out SSL pretraining for ASR. I have about 50k hours of unlabeled data and 2k hours of transcribed data. With 100k hours of data, first of all I couldn't get the default dataloader to work because CPU runs out of memory within one epoch. Then I train with only contrastive loss, but after about 50k steps SSL contrastive loss kind of starts to stagnate.

I use two checkpoints one from around 40k steps and another from the end of 80k steps to finetune with labeled ASR data. The checkpoint from 80k steps converges much slower.

It seems that the more I pretrain, the worse pretrained model I get. Is it expected? What could I be doing wrong? Furthermore, this is not an isolated case, I've tried other combinations of parameters / models e.g. Conformer, Fast-conformer, and tried smaller dataset for pretraining and with smaller data the pretrained model seems to be better as well.

MostafaAhmed98 · 2024-05-22T19:47:21Z

MostafaAhmed98
May 22, 2024

Hello, could you tried to low your learning rate or used another optimizers like AdamW ?

3 replies

riqiang-dp May 24, 2024
Author

The weird thing is, as I lower my learning rate, the loss curve looks like it's overfitting even more quickly. In terms of optimizer I have been using AdamW

MostafaAhmed98 May 25, 2024

what kind of data augmentation you are using ?

riqiang-dp May 27, 2024
Author

I'm using Lhotse dataloader because the default dataloader makes the CPU run out of memory within one epoch. And I'm using min duration of 4, plus

++model.train_ds.cut_into_windows_duration=8 \
++model.train_ds.cut_into_windows_hop=6 \

to constraint the duration to 4-8 secs per utterance.
Otherwise the masaked patch augmentation like the default config for SSL:

  spec_augment:
    _target_: nemo.collections.asr.modules.MaskedPatchAugmentation
    freq_masks: 3
    freq_width: 20
    patch_size: 48
    mask_patches: 0.5

titu1994 · 2024-05-22T20:50:32Z

titu1994
May 22, 2024
Maintainer

@nithinraok can you comment

2 replies

nithinraok May 25, 2024
Maintainer

@riqiang-dp could you please share your training script.

riqiang-dp May 27, 2024
Author

Can't share the exact script but I'm basically using the defaut SSL config and script for conformer, but modified the minimal part to make it a causal cache-aware model with multiple context sizes, and using lhotse data loader. Some specific configs relating to SSL I shared in the other comment. I'm just wondering if the default config in the examples folder has been shown to improve ASR performance? (WER) And if so what kind of data is expected to be used.

riqiang-dp · 2024-05-27T23:13:01Z

riqiang-dp
May 27, 2024
Author

Another problem I ran into is that with the same data sampling and setting, the training phase runs fine but validation fails at the contrastive loss calculation:

NeMo/nemo/collections/asr/losses/ssl_losses/contrastive.py", line 190, in forward
    out_masked_only = out_masked_only.reshape(bs, -1, out_masked_only.shape[-1])
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[20, -1, 128]' is invalid for input of size 175360

1 reply

nithinraok May 30, 2024
Maintainer

I haven;t seen this issue, however I noticed similar issue with contrastive loss and provided a fix for it here: #9259

riqiang-dp · 2024-05-29T22:02:35Z

riqiang-dp
May 29, 2024
Author

Lastest status is that I trained 120k steps which took several days and took the 80k step checkpoint and 120k step checkpoint to finetune on ASR data, vs training an ASR model from scratch, and although the pretrained checkpoints are faster to converge, they slow down and get surpassed by the model trained from scratch. Nevertheless the 120k step model is better than 80k. See this plot of WER curve:

So I wouldn't say the more I train the worse weights it gets, but still if it's worse than training from random weights it defeats the purpose.

Some more details of this pretraining run:

    python3 speech_pre_training.py \
        --config-name=conformer_small_streaming_ssl \
        model.train_ds.manifest_filepath=data/train.json \
        model.train_ds.max_duration=4 \
        model.train_ds.min_duration=45 \
        model.train_ds.shuffle=true \
        model.train_ds.shuffle_n=2048 \
        model.train_ds.is_tarred=False \
        model.train_ds.num_workers=8 \
        model.train_ds.batch_size=120 \
        model.train_ds.pin_memory=True \
        model.validation_ds.manifest_filepath="[${dev_manifests}]" \
        model.validation_ds.batch_size=20 \
        ++model.validation_ds.min_duration=4 \
        ++model.validation_ds.max_duration=30 \
        model.validation_ds.num_workers=$nj \
        model.optim.lr=${lr} \
        model.optim.sched.warmup_steps=${warmup} \
        ++model.train_ds.use_lhotse=True \
        ++model.train_ds.batch_duration=1000 \
        ++model.train_ds.quadratic_duration=30 \
        ++model.train_ds.num_buckets=30 \
        ++model.train_ds.bucket_buffer_size=10000 \
        ++model.train_ds.shuffle_buffer_size=10000 \
        ++model.train_ds.use_bucketing=True \
        ++model.train_ds.num_cuts_for_bins_estimate=500 \
        ++model.train_ds.cut_into_windows_duration=8 \
        ++model.train_ds.cut_into_windows_hop=6 \
        ++model.loss_list.contrastive.loss.num_negatives=50 \
        ++trainer.use_distributed_sampler=false \
        ++trainer.limit_train_batches=1000 \
        trainer.val_check_interval=1000 \
        trainer.max_steps=300000 \
        trainer.max_epochs=120 \
        trainer.precision=bf16-mixed

  optim:
    name: adamw
    lr: 4
    betas:
    - 0.9
    - 0.98
    weight_decay: 0.0
    sched:
      name: NoamAnnealing
      d_model: 176
      warmup_steps: 40000
      warmup_ratio: null
      min_lr: 1.0e-06

There's also quite a bit of oscillation in the validation loss:

(weird thing is this is also on an out of domain validation set but with more uniform duration, people's speech duration is around 15s per utterance. With in domain validation set there's even more oscillation from my other training runs.

I guess my question is: is pretraining with this contrastive loss supposed to take a long time to converge to a useful checkpoint? (From literature I've seen people pretrain for relatively short duration. Should I further lower LR / increase warmup to stabilize training?

2 replies

nithinraok May 30, 2024
Maintainer

I am not sure if you are using conformer or fastconformer and the model size. These factors affect the training speed. @pzelasko could you validate lhotse arguments.

I could confirm that for non causal models pretraining definitely helps. With causal models we haven;t experimented much and I am currently training some very large FastConformer models, so I would only know sooner when the training gets finished. But based on my experiments, pretraining will always help for stable training, improved performance and faster convergence.

itzsimpl May 30, 2024

@riqiang-dp I noticed this ... it seems you may have min and max duration for train_ds mixed up

        model.train_ds.max_duration=4 \
        model.train_ds.min_duration=45 \
        ...
        ++model.validation_ds.min_duration=4 \
        ++model.validation_ds.max_duration=30 \

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The more I pretrain (SSL), the worse fine-tuned model gets? #9175

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

The more I pretrain (SSL), the worse fine-tuned model gets? #9175

riqiang-dp May 13, 2024

Replies: 4 comments · 8 replies

MostafaAhmed98 May 22, 2024

riqiang-dp May 24, 2024 Author

MostafaAhmed98 May 25, 2024

riqiang-dp May 27, 2024 Author

titu1994 May 22, 2024 Maintainer

nithinraok May 25, 2024 Maintainer

riqiang-dp May 27, 2024 Author

riqiang-dp May 27, 2024 Author

nithinraok May 30, 2024 Maintainer

riqiang-dp May 29, 2024 Author

nithinraok May 30, 2024 Maintainer

itzsimpl May 30, 2024

riqiang-dp
May 13, 2024

Replies: 4 comments 8 replies

MostafaAhmed98
May 22, 2024

riqiang-dp May 24, 2024
Author

riqiang-dp May 27, 2024
Author

titu1994
May 22, 2024
Maintainer

nithinraok May 25, 2024
Maintainer

riqiang-dp May 27, 2024
Author

riqiang-dp
May 27, 2024
Author

nithinraok May 30, 2024
Maintainer

riqiang-dp
May 29, 2024
Author

nithinraok May 30, 2024
Maintainer