Layer-Wise Learning Rate #984

MohammadJavadD · 2023-06-14T23:37:06Z

MohammadJavadD
Jun 14, 2023

Is it possible to do Layer-Wise Learning Rate in Skorch similar to what is explained here for Pytorch?

Jun 17, 2023

To implement layer-wise learning rates you can facilitate the param group feature of pytorch optimizers. In essence, you can define parameter group with optimizer-specific attributes (like the learning rate). In PyTorch you would have to provide the actual parameters. Since skorch uses lazy evaluation the parameters are not known before initialization, therefore we provide you with our own version of param groups that matches with the parameter names instead of the actual parameter objects.

A basic example from the docs:

net = NeuralNet(
    my_net,
    optimizer__param_groups=[
        ('embedding.*', {'lr': 0.0}),
        ('linear0.bias', {'lr': 1}),
    ],
)

Note that embedding and lin…

View full answer

MohammadJavadD · 2023-06-15T21:50:21Z

MohammadJavadD
Jun 15, 2023
Author

I tried to write a customized optimizer and pass it to the classifier as:

 class dis_lr(Adam):
        def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0, amsgrad=False):
            # lr_func is a function that takes the layer index and returns the learning rate for that layer
            super().__init__(params, lr, betas, eps, weight_decay, amsgrad)
            # self.lr_func = lr_func


        def step(self, closure=None):
            # update the learning rate for each layer based on the lr_func
            # if self.lr_func is not None:
            for i in range(len(self.param_groups)):
                self.param_groups[i]['lr'] = self.lr_decay(i)
            # call the original step method
            return super().step(closure)

        # define a function that assigns a lower learning rate to the layers that are closer to the input
        def lr_decay(self, layer_index):
            return 1e-3 * (0.9 ** layer_index)`

I'm not sure it s the right way to do it here.

1 reply

githubnemo Jun 17, 2023

While this certainly is a way forward, it seems to me like you want to use skorch.callbacks.LRScheduler along with param groups as mentioned in my other answer.

Your approach has certainly the benefit of being able to specify a learning rate scheduler per param group which is not as easily possible with PyTorch's learn rate schedulers as they generally assume that every param group has the same schedule (but different learning rates) - not unreasonable but it is certainly possible that one wants a bit more flexibility which would warrant your more tightly integrated approach.

githubnemo · 2023-06-17T16:44:51Z

githubnemo
Jun 17, 2023

To implement layer-wise learning rates you can facilitate the param group feature of pytorch optimizers. In essence, you can define parameter group with optimizer-specific attributes (like the learning rate). In PyTorch you would have to provide the actual parameters. Since skorch uses lazy evaluation the parameters are not known before initialization, therefore we provide you with our own version of param groups that matches with the parameter names instead of the actual parameter objects.

A basic example from the docs:

net = NeuralNet(
    my_net,
    optimizer__param_groups=[
        ('embedding.*', {'lr': 0.0}),
        ('linear0.bias', {'lr': 1}),
    ],
)

Note that embedding and linear0 are parameter names which get matched with fnmatch, i.e. you can use globbing which makes parameter selection a lot easier. These names are obviously model dependent and defined by the attribute names. You can get a list of the parameter names for example by running this on your model:

[name for name, param in torchvision.models.vit_b_16().named_parameters()]

0 replies

MohammadJavadD · 2023-06-21T20:41:07Z

MohammadJavadD
Jun 21, 2023
Author

Thank you @githubnemo
This worked. I have a follow-up question, Can we use the Lr schedule as well? What would happen if I use ("lr_scheduler", LRScheduler('CosineAnnealingLR', T_max=n_epochs - 1)), as a callback? will the .fit reset all the LRs or will schedule them separately?

1 reply

ottonemo Jun 22, 2023
Maintainer

See my other answer. Basically the LR scheduler changes the learning rate uniformly over all param groups. You can, of course, change this behavior by implementing your own schedulers (not hard). But I think that multiplicative LR schedules develop differently over time anyway thanks to different base LRs, so you might not need to roll your own schedule.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layer-Wise Learning Rate #984

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Layer-Wise Learning Rate #984

MohammadJavadD Jun 14, 2023

Replies: 3 comments · 2 replies

MohammadJavadD Jun 15, 2023 Author

githubnemo Jun 17, 2023

githubnemo Jun 17, 2023

MohammadJavadD Jun 21, 2023 Author

ottonemo Jun 22, 2023 Maintainer

MohammadJavadD
Jun 14, 2023

Replies: 3 comments 2 replies

MohammadJavadD
Jun 15, 2023
Author

githubnemo
Jun 17, 2023

MohammadJavadD
Jun 21, 2023
Author

ottonemo Jun 22, 2023
Maintainer