Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ACME with Ray Tune ASHAScheduler #311

Open
lyshuga opened this issue Oct 31, 2023 · 0 comments
Open

Use ACME with Ray Tune ASHAScheduler #311

lyshuga opened this issue Oct 31, 2023 · 0 comments

Comments

@lyshuga
Copy link

lyshuga commented Oct 31, 2023

Hi,

I have experienced an issue running Ray Tune experiments together with ACME distributed SAC training and ASHA scheduler. The idea behind ASHA scheduler is that it will terminate the jobs earlier in order to find the correct hyperparameters. When the single-process experiment is used, the job does not leave any hanging processes after ray terminates a job. Here is an example how the job is started

experiments.run_experiment(
        experiment=experiment,
        eval_every=1_000,
        num_eval_episodes=1)

On the contrary when the job is started in multi-processing way, the termination of job, done by ray, does not affect the processes that launchpad had spawned. So, what happens is that the training ACME job continues running instead of being terminated. Here is an example how the job function is defined and ray tune config.

def create_and_run_program(config):
    experiment = build_experiment_config(config)

    program = experiments.make_distributed_experiment(
        experiment=experiment,
        num_actors=1,
    )
    lp.launch(program, lp.LaunchType.LOCAL_MULTI_PROCESSING)


def train_function(config):
    p = mp.Process(
        target=create_and_run_program,
        args=(config,))
    p.start()
    p.join() # this blocks until the process terminates

trainable_with_cpu_gpu = tune.with_resources(train_function, {"cpu": 16 , "gpu": 1})

asha_scheduler = ASHAScheduler(
    time_attr="training_iteration",
    max_t=100000,
    grace_period=20000,
    reduction_factor=4,
    brackets=1,
)

tuner = tune.Tuner(
    trainable_with_cpu_gpu,
    tune_config=tune.TuneConfig(
        scheduler=asha_scheduler,
        metric="rewards_episode",
        mode="max",
        num_samples=100,
        reuse_actors=False
    ),
    param_space=config_space,
)


results = tuner.fit()

My question is there a way to somehow forward the termination signal from ray (when it terminates its job) to all the node processes?

Thank you in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant