Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XM Docker distributed experiments don't finish cleanly #312

Open
JoeMWatson opened this issue Nov 13, 2023 · 0 comments
Open

XM Docker distributed experiments don't finish cleanly #312

JoeMWatson opened this issue Nov 13, 2023 · 0 comments

Comments

@JoeMWatson
Copy link

Hi,

When running one of the baseline examples in a distributed fashion, e.g.

python run_gail.py --num_steps=1 --run_distributed=True

The experiment finishes with the output

I1113 10:19:41.710619 139662026843712 lp_utils.py:98] StepsLimiter: Max steps of 1 was reached, terminating
I1113 10:19:41.711068 139671506786112 savers.py:205] Caught SIGTERM: forcing a checkpoint save.
Worker groups that did not terminate in time: ['actor']
Killing entire runtime.
Killed

While the experiment has run successfully, this messy teardown means that you cannot end the experiment runner properly. When using external loggers like Weights and Biases, which means the experiment is reported as 'crashed' even though it's run successfully.

In my own (imitation learning) code, I see the same problem for full experiments, but the error message is

Worker groups that did not terminate in time: ['learner']

Is there a way to have a cleaner teardown? The XM docker launch function in launchpad doesn't appear to return an object that you can use for a smart wait or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant