Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch 2.3 breaks DDP? #2527

Closed
TParcollet opened this issue Apr 26, 2024 · 7 comments
Closed

Torch 2.3 breaks DDP? #2527

TParcollet opened this issue Apr 26, 2024 · 7 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed important

Comments

@TParcollet
Copy link
Collaborator

TParcollet commented Apr 26, 2024

Describe the bug

Starting any recipe with two or + processes will fail using DDP as the run_on_main function does not catch (no barrier?) the other processes.

From the doc of 2.3, we can read: "ProcessGroupNCCL now relies on stream synchronization instead of device synchronization to block the CPU. Thus, please do not assume that barrier() would perform a device synchronization." This is scary.

Expected behaviour

Well, the barrier stops the other processes.

To Reproduce

No response

Environment Details

No response

Relevant Log Output

No response

Additional Context

No response

@TParcollet TParcollet added the bug Something isn't working label Apr 26, 2024
@TParcollet TParcollet added help wanted Extra attention is needed important labels Apr 26, 2024
@Adel-Moumen
Copy link
Collaborator

I'm unable to reproduce your issue.

I tried on a two nodes settings with the following example:

    import os 
    if os.environ['RANK'] == '0':
        import time
        # wait for 2s
        print("Rank 0 is sleeping")
        time.sleep(5)
        print("Rank 0 finished sleeping")
        print("Rank 0 is ready for barrier")
        sb.utils.distributed.ddp_barrier()
        print("Rank 0 is done")
    else:
        print(f"Rank {os.environ['RANK']} is ready for barrier")
        sb.utils.distributed.ddp_barrier()
        print(f"Rank {os.environ['RANK']} is done")

And got the following output:

Rank 1 is ready for barrier
Rank 0 is sleeping
Rank 0 finished sleeping
Rank 0 is ready for barrier
Rank 0 is done
Rank 1 is done

I also tried with a recipe:

    from librispeech_prepare import prepare_librispeech  # noqa

    print("BEFORE")
    print("RANK", os.environ["RANK"])

    # multi-gpu (ddp) save data preparation
    run_on_main(
        prepare_librispeech,
        kwargs={
            "data_folder": hparams["data_folder"],
            "tr_splits": hparams["train_splits"],
            "dev_splits": hparams["dev_splits"],
            "te_splits": hparams["test_splits"],
            "save_folder": hparams["output_folder"],
            "merge_lst": hparams["train_splits"],
            "merge_name": "train.csv",
            "skip_prep": hparams["skip_prep"],
        },
    )
    print("AFTER")
    print("RANK", os.environ["RANK"])

    sb.utils.distributed.ddp_barrier() 

    if os.environ['RANK'] == '0':
        print("*" * 10)
    print("BEFORE")
    print("RANK", os.environ["RANK"])

    # multi-gpu (ddp) save data preparation
    run_on_main(
        prepare_librispeech,
        kwargs={
            "data_folder": hparams["data_folder"],
            "tr_splits": hparams["train_splits"],
            "dev_splits": hparams["dev_splits"],
            "te_splits": hparams["test_splits"],
            "save_folder": hparams["output_folder"],
            "merge_lst": hparams["train_splits"],
            "merge_name": "train.csv",
            "skip_prep": hparams["skip_prep"],
        },
    )
    print("AFTER")
    print("RANK", os.environ["RANK"])

And got:

BEFORE
RANK 0
BEFORE
RANK 1
librispeech_prepare - Skipping preparation, completed in previous run.
AFTER
RANK 0
AFTER
RANK 1
**********
BEFORE
RANK 0
BEFORE
RANK 1
librispeech_prepare - Skipping preparation, completed in previous run.
AFTER
RANK 0
AFTER
RANK 1

I also tried to put multiple barrier and also got the expected results.

@TParcollet
Copy link
Collaborator Author

Mystery / 20, I'll investigate once done with my current duties at SAIC.

@asumagic
Copy link
Collaborator

Can't repro on Jean Zay with 4x A100 on DDP either with PyTorch 2.3.0

@asumagic
Copy link
Collaborator

Can't repro on Adastra with 8x MI250X on DDP either, also PyTorch 2.3.0

Also:

From the doc of 2.3, we can read: "ProcessGroupNCCL now relies on stream synchronization instead of device synchronization to block the CPU. Thus, please do not assume that barrier() would perform a device synchronization." This is scary.

It did not occur to me at the time, but I don't think this is actually a problem: I suspect "device synchronization" does not refer to the synchronization "between devices", but rather to placing a barrier to only the current CUDA stream on a device rather than all CUDA streams.

This would be relevant if we made explicit use of CUDA streams but we don't.
The documentation as phrased still says this is to be used as a barrier between processes, which is all that we use it for.

@Adel-Moumen
Copy link
Collaborator

I suggest to close this issue.

@TParcollet
Copy link
Collaborator Author

TParcollet commented May 31, 2024

Right, it's most likely a hardware issue on my side. It still happens, dunno why.

@asumagic
Copy link
Collaborator

Right, it's most likely a hardware issue on my side. It still happens, dunno why.

Maybe it's something similar to what @Adel-Moumen what at some point where all processes thought they were the main one IIRC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed important
Projects
None yet
Development

No branches or pull requests

4 participants