[Core] support saving and loading of sharded checkpoints #7830

sayakpaul · 2024-05-01T10:46:03Z

What does this PR do?

Follow-up of #6396.

This PR adds support for saving a big model's state dict into multiple shards for efficient portability and loading. Adds support for loading the sharded checkpoints, too.

This is much akin to handling big models like T5XXL.

Also, added a nice test to ensure the models that have _no_split_modules specified can be sharded and loaded back to perform inference ensuring numerical assertions.

Here's a real use-case. Consider this Transformer2DModel checkpoint: https://huggingface.co/sayakpaul/actual_bigger_transformer/.

It was serialized like so:

from diffusers import Transformer2DModel
from accelerate.utils import compute_module_sizes, shard_checkpoint
from accelerate import init_empty_weights
import torch.nn as nn

def bytes_to_giga_bytes(bytes):
    return f"{(bytes / 1024 / 1024 / 1024):.3f}"

with init_empty_weights():
    pixart_transformer = Transformer2DModel.from_config("PixArt-alpha/PixArt-XL-2-1024-MS", subfolder="transformer")
    bigger_transformer = Transformer2DModel.from_config(
        pixart_transformer.config, num_layers=72, num_attention_heads=36, cross_attention_dim=2592,
    )
    module_size = bytes_to_giga_bytes(compute_module_sizes(bigger_transformer)[""])
    print(f"{module_size=} GB")
    pytorch_total_params = sum(p.numel() for p in bigger_transformer.parameters()) / 1e9
    print(f"{pytorch_total_params=} B")

    model = nn.Sequential(*[nn.Linear(8944, 8944) for _ in range(1000)])
    module_size = bytes_to_giga_bytes(compute_module_sizes(model)[""])
    print(f"{module_size=} GB")
    pytorch_total_params = sum(p.numel() for p in model.parameters()) / 1e9
    print(f"{pytorch_total_params=} B")

actual_bigger_transformer = Transformer2DModel.from_config(
    pixart_transformer.config, num_layers=72, num_attention_heads=36, cross_attention_dim=2592
)
actual_bigger_transformer.save_pretrained("/raid/.cache/actual_bigger_transformer", max_shard_size="10GB", push_to_hub=True)

As we can see from the Hub repo that its state dict is sharded. To perform with the model, all we have to do is this:

from diffusers import Transformer2DModel
import tempfile
import torch
import os

def get_inputs():
    sample = torch.randn(1, 4, 128, 128)
    timestep = torch.randint(0, 1000, size=(1, ))
    encoder_hidden_states = torch.randn(1, 120, 4096)

    resolution = torch.tensor([1024, 1024]).repeat(1, 1)
    aspect_ratio = torch.tensor([1.]).repeat(1, 1)
    added_cond_kwargs = {"resolution": resolution, "aspect_ratio": aspect_ratio}
    return sample, timestep, encoder_hidden_states, added_cond_kwargs

with torch.no_grad():
    # max_memory = {0: "15GB"} # reasonable estimate for a consumer-gpu.
    with tempfile.TemporaryDirectory() as tmp_dir:
        new_model = Transformer2DModel.from_pretrained(
            "sayakpaul/actual_bigger_transformer",
            device_map="auto",
        )

        sample, timestep, encoder_hidden_states, added_cond_kwargs = get_inputs()
        out = new_model(
            hidden_states=sample,
            encoder_hidden_states=encoder_hidden_states,
            timestep=timestep, 
            added_cond_kwargs=added_cond_kwargs
        ).sample
        print(f"{out.shape=}, {out.device=}")

I haven't purposefully haven't added documentation because all of this will become useful once we use this in the context of a full-fledged pipeline execution (up next) :)

HuggingFaceDocBuilderDev · 2024-05-01T10:51:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/diffusers/models/modeling_utils.py

src/diffusers/utils/hub_utils.py

sayakpaul · 2024-05-10T18:45:48Z

@yiyixuxu @SunMarc a gentle ping here.

BenjaminBossan

Always delightful to deal with the from_pretrained code ;)

I don't really have any bigger comments, as this should hopefully work well since it's based on the transformers implementation. Only some smaller comments.

src/diffusers/models/modeling_utils.py

src/diffusers/utils/hub_utils.py

tests/models/test_modeling_common.py

SunMarc

Thanks for your work @sayakpaul ! Left a suggestion (not a blocker, we can do it afterwards if needed) ! No major comments since @BenjaminBossan did a very thorough review already !

src/diffusers/models/modeling_utils.py

… okay

Co-authored-by: Lucain <lucainp@gmail.com>

Wauplin

Thanks a lot @sayakpaul for the integration and iterating over it! Current code looks good to me :) I'd rather have another pair of eyes reviewing it, given it's fairly easy to miss something when iterating/reviewing several times on the same code.

Thanks again!

src/diffusers/utils/hub_utils.py

sayakpaul · 2024-05-29T09:17:45Z

I'd rather have another pair of eyes reviewing it, given it's fairly easy to miss something when iterating/reviewing several times on the same code.

Yeah. @yiyixuxu would be the final approver here :)

yiyixuxu

thanks for the PR!!
I left some comments and questions :)

yiyixuxu · 2024-05-31T00:45:51Z

src/diffusers/configuration_utils.py

@@ -349,7 +349,7 @@ def load_config(
        local_files_only = kwargs.pop("local_files_only", False)
        revision = kwargs.pop("revision", None)
        _ = kwargs.pop("mirror", None)
-        subfolder = kwargs.pop("subfolder", None)
+        subfolder = kwargs.pop("subfolder", None) or ""


why don't we handle it where it fails then

diffusers/src/diffusers/models/modeling_utils.py

Line 658 in 0706cae

subfolder,

we would only need to change one place, no?

src/diffusers/models/modeling_utils.py

src/diffusers/utils/hub_utils.py

sayakpaul · 2024-06-03T12:35:19Z

@yiyixuxu do the recent changes work for you?

(I have run the tests)

sayakpaul added 6 commits May 1, 2024 12:10

feat: support saving a model in sharded checkpoints.

b566c95

feat: make loading of sharded checkpoints work.

8605909

add tests

885d5b6

cleanse the loading logic a bit more.

560fe32

more resilience while loading from the Hub.

fc5d837

parallelize shard downloads by using snapshot_download()/

0d3b9e1

sayakpaul requested review from yiyixuxu and SunMarc May 1, 2024 10:46

sayakpaul added 9 commits May 1, 2024 16:49

default to a shard size.

df8e945

more fix

6eff632

Empty-Commit

ed83244

debug

642ee39

fix

36de0c4

uality

cc5656e

more debugging

8898717

fix more

2dfb9a1

Merge branch 'main' into feat-save-sharded-ckpt

179495f

sayakpaul commented May 1, 2024

View reviewed changes

src/diffusers/models/modeling_utils.py Outdated Show resolved Hide resolved

sayakpaul commented May 1, 2024

View reviewed changes

src/diffusers/utils/hub_utils.py Outdated Show resolved Hide resolved

merge main and fix conflicts.

7e2c09b

yiyixuxu requested a review from BenjaminBossan May 13, 2024 22:24

BenjaminBossan reviewed May 14, 2024

View reviewed changes

SunMarc approved these changes May 14, 2024

View reviewed changes

src/diffusers/models/modeling_utils.py Outdated Show resolved Hide resolved

sayakpaul added 5 commits May 15, 2024 09:48

resolve conflicts.

3535701

initial comments from Benjamin

5ae8e46

move certain methods to loading_utils

aefd0db

add test to check if the correct number of shards are present.

80005be

add a test to check if loading of sharded checkpoints from the Hub is…

d144526

… okay

sayakpaul and others added 5 commits May 28, 2024 22:43

Merge branch 'main' into feat-save-sharded-ckpt

b03e13c

Apply suggestions from code review

cbfd70f

Co-authored-by: Lucain <lucainp@gmail.com>

remove _huggingface_hub_version as not needed.

868cfb6

address more feedback.

13fd063

Merge branch 'main' into feat-save-sharded-ckpt

0d32c45

sayakpaul requested a review from Wauplin May 29, 2024 01:42

sayakpaul added 5 commits May 29, 2024 07:27

add a test for local_files_only=True/

bad44c0

need hf hub to be at least 0.23.2

c779618

style

302d59d

Merge branch 'main' into feat-save-sharded-ckpt

ab3a5aa

Merge branch 'main' into feat-save-sharded-ckpt

7f88742

Wauplin approved these changes May 29, 2024

View reviewed changes

src/diffusers/utils/hub_utils.py Outdated Show resolved Hide resolved

final comment.

a7fc2ae

yiyixuxu reviewed May 31, 2024

View reviewed changes

sayakpaul added 11 commits June 3, 2024 14:18

Merge branch 'main' into feat-save-sharded-ckpt

f74fc67

clean up subfolder.

38749fc

deal with suffixes in code.

edbd8de

_add_variant default.

2ecd4da

use weights_name_pattern

d51d0b9

remove add_suffix_keyword

c2a71a0

clean up downloading of sharded ckpts.

a70e927

don't return something special when using index.json

65da7dc

fix more

5599388

don't use bare except

7cdf958

remove comments and catch the errors better

16dcdf8

sayakpaul requested a review from yiyixuxu June 3, 2024 12:35

fix a couple of things when using is_file()

737e627

Wauplin mentioned this pull request Jun 4, 2024

[Draft] Save Pytorch state dict huggingface/huggingface_hub#2314

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] support saving and loading of sharded checkpoints #7830

[Core] support saving and loading of sharded checkpoints #7830

sayakpaul commented May 1, 2024

HuggingFaceDocBuilderDev commented May 1, 2024

sayakpaul commented May 10, 2024

BenjaminBossan left a comment

SunMarc left a comment

Wauplin left a comment

sayakpaul commented May 29, 2024 •

edited

yiyixuxu left a comment

yiyixuxu May 31, 2024

sayakpaul commented Jun 3, 2024 •

edited

[Core] support saving and loading of sharded checkpoints #7830

Are you sure you want to change the base?

[Core] support saving and loading of sharded checkpoints #7830

Conversation

sayakpaul commented May 1, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented May 1, 2024

sayakpaul commented May 10, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

Wauplin left a comment

Choose a reason for hiding this comment

sayakpaul commented May 29, 2024 • edited

yiyixuxu left a comment

Choose a reason for hiding this comment

yiyixuxu May 31, 2024

Choose a reason for hiding this comment

sayakpaul commented Jun 3, 2024 • edited

sayakpaul commented May 29, 2024 •

edited

sayakpaul commented Jun 3, 2024 •

edited