Torch 2.3 breaks DDP? #2527

TParcollet · 2024-04-26T14:43:10Z

Describe the bug

Starting any recipe with two or + processes will fail using DDP as the run_on_main function does not catch (no barrier?) the other processes.

From the doc of 2.3, we can read: "ProcessGroupNCCL now relies on stream synchronization instead of device synchronization to block the CPU. Thus, please do not assume that barrier() would perform a device synchronization." This is scary.

Expected behaviour

Well, the barrier stops the other processes.

To Reproduce

No response

Environment Details

No response

Relevant Log Output

No response

Additional Context

No response

Adel-Moumen · 2024-04-29T13:28:28Z

I'm unable to reproduce your issue.

I tried on a two nodes settings with the following example:

    import os 
    if os.environ['RANK'] == '0':
        import time
        # wait for 2s
        print("Rank 0 is sleeping")
        time.sleep(5)
        print("Rank 0 finished sleeping")
        print("Rank 0 is ready for barrier")
        sb.utils.distributed.ddp_barrier()
        print("Rank 0 is done")
    else:
        print(f"Rank {os.environ['RANK']} is ready for barrier")
        sb.utils.distributed.ddp_barrier()
        print(f"Rank {os.environ['RANK']} is done")

And got the following output:

Rank 1 is ready for barrier
Rank 0 is sleeping
Rank 0 finished sleeping
Rank 0 is ready for barrier
Rank 0 is done
Rank 1 is done

I also tried with a recipe:

    from librispeech_prepare import prepare_librispeech  # noqa

    print("BEFORE")
    print("RANK", os.environ["RANK"])

    # multi-gpu (ddp) save data preparation
    run_on_main(
        prepare_librispeech,
        kwargs={
            "data_folder": hparams["data_folder"],
            "tr_splits": hparams["train_splits"],
            "dev_splits": hparams["dev_splits"],
            "te_splits": hparams["test_splits"],
            "save_folder": hparams["output_folder"],
            "merge_lst": hparams["train_splits"],
            "merge_name": "train.csv",
            "skip_prep": hparams["skip_prep"],
        },
    )
    print("AFTER")
    print("RANK", os.environ["RANK"])

    sb.utils.distributed.ddp_barrier() 

    if os.environ['RANK'] == '0':
        print("*" * 10)
    print("BEFORE")
    print("RANK", os.environ["RANK"])

    # multi-gpu (ddp) save data preparation
    run_on_main(
        prepare_librispeech,
        kwargs={
            "data_folder": hparams["data_folder"],
            "tr_splits": hparams["train_splits"],
            "dev_splits": hparams["dev_splits"],
            "te_splits": hparams["test_splits"],
            "save_folder": hparams["output_folder"],
            "merge_lst": hparams["train_splits"],
            "merge_name": "train.csv",
            "skip_prep": hparams["skip_prep"],
        },
    )
    print("AFTER")
    print("RANK", os.environ["RANK"])

And got:

BEFORE
RANK 0
BEFORE
RANK 1
librispeech_prepare - Skipping preparation, completed in previous run.
AFTER
RANK 0
AFTER
RANK 1
**********
BEFORE
RANK 0
BEFORE
RANK 1
librispeech_prepare - Skipping preparation, completed in previous run.
AFTER
RANK 0
AFTER
RANK 1

I also tried to put multiple barrier and also got the expected results.

TParcollet · 2024-04-29T17:50:57Z

Mystery / 20, I'll investigate once done with my current duties at SAIC.

asumagic · 2024-05-16T11:45:17Z

Can't repro on Jean Zay with 4x A100 on DDP either with PyTorch 2.3.0

asumagic · 2024-05-30T14:14:06Z

Can't repro on Adastra with 8x MI250X on DDP either, also PyTorch 2.3.0

Also:

From the doc of 2.3, we can read: "ProcessGroupNCCL now relies on stream synchronization instead of device synchronization to block the CPU. Thus, please do not assume that barrier() would perform a device synchronization." This is scary.

It did not occur to me at the time, but I don't think this is actually a problem: I suspect "device synchronization" does not refer to the synchronization "between devices", but rather to placing a barrier to only the current CUDA stream on a device rather than all CUDA streams.

This would be relevant if we made explicit use of CUDA streams but we don't.
The documentation as phrased still says this is to be used as a barrier between processes, which is all that we use it for.

Adel-Moumen · 2024-05-30T15:31:30Z

I suggest to close this issue.

TParcollet · 2024-05-31T12:07:42Z

Right, it's most likely a hardware issue on my side. It still happens, dunno why.

asumagic · 2024-05-31T12:11:59Z

Right, it's most likely a hardware issue on my side. It still happens, dunno why.

Maybe it's something similar to what @Adel-Moumen what at some point where all processes thought they were the main one IIRC?

TParcollet added the bug Something isn't working label Apr 26, 2024

TParcollet assigned pplantinga, asumagic, TParcollet and Adel-Moumen Apr 26, 2024

TParcollet added help wanted Extra attention is needed important labels Apr 26, 2024

TParcollet closed this as completed May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch 2.3 breaks DDP? #2527

Torch 2.3 breaks DDP? #2527

TParcollet commented Apr 26, 2024 •

edited

Adel-Moumen commented Apr 29, 2024

TParcollet commented Apr 29, 2024

asumagic commented May 16, 2024

asumagic commented May 30, 2024

Adel-Moumen commented May 30, 2024

TParcollet commented May 31, 2024 •

edited

asumagic commented May 31, 2024

Torch 2.3 breaks DDP? #2527

Torch 2.3 breaks DDP? #2527

Comments

TParcollet commented Apr 26, 2024 • edited

Describe the bug

Expected behaviour

To Reproduce

Environment Details

Relevant Log Output

Additional Context

Adel-Moumen commented Apr 29, 2024

TParcollet commented Apr 29, 2024

asumagic commented May 16, 2024

asumagic commented May 30, 2024

Adel-Moumen commented May 30, 2024

TParcollet commented May 31, 2024 • edited

asumagic commented May 31, 2024

TParcollet commented Apr 26, 2024 •

edited

TParcollet commented May 31, 2024 •

edited