Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): fix a deadlock issue on coroutine mutex #13019

Conversation

outsinre
Copy link
Contributor

Summary

fix(core): fix a deadlock issue on coroutine mutex

Coroutine mutex uses semaphore to ensure only one job
is running for a specific event (e.g. reconfigure_handler).

However, if the coroutine job is wrapped within a light thread
(e.g. worker event's event_thread) and the associated thread
was killed in the middle of the job, semaphore post followed would
be skipped.

Light threads would be killed when it cannot connect to the worker
0 which is common if the worker 0 is OOM Killed by OS.

All subsequent jobs of the same event would wait and timeout, as
there is not any semaphore resource available.

Please refer to the analysis here https://konghq.atlassian.net/browse/FTI-5930?focusedCommentId=131903.

Checklist

  • The Pull Request has tests
  • A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

Fix #FTI-5930

@github-actions github-actions bot added the cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee label May 13, 2024
@outsinre
Copy link
Contributor Author

Restore deleted comments:

Screenshot 2024-05-13 at 15 14 13

Comment on lines 112 to 115
-- to resolve deadlock issue in case the worker event thread
-- associated with `pcall(fn)` below got killed
-- and the `:post(1)` followed is skipped.
if semaphore:count() <= 0 then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore missed comments:

Screenshot 2024-05-13 at 15 12 39

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. @samugi we should at least post 1 for unlock loop dependencies. Post 1 is safe, just as did here: https://github.com/Kong/kong/blob/master/kong/concurrency.lua#L101-L110.
  2. @dndx the pcall failed has multiple causes. It runs within a timer thread spawn by worker events. If the thread aborts abnormally in the middle of pcall, there is no chance to semaphore:post(1).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the thread abort in the worker event callback? Is this an intended behavior?

Copy link
Contributor Author

@outsinre outsinre May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dndx for example, the worker 0 is OOM, then it is killed. If worker 0 is killed, then other workers cannot connect to it and error.

Other workers will close the reconfigure thread when it is still applying new config (pcall). https://github.com/Kong/lua-resty-events/blob/main/lualib/resty/events/worker.lua#L249-L253.

Copy link
Member

@bungle bungle May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this has nothing to do with workers being killed. Coroutine mutexes don't have any meaning across process boundaries. @outsinre so please stop spreading this misinformation. I already answered in original PR (that is exactly the same as this one). And stop reopning this. It is just a bug that you are creating here (much bigger than the original it tries to fix).

Yes, this may be because other processes kill prematurely their processes (like event callbacks) when the broker is killed by OOM. But that is the bug in prematurely killing event callbacks.

@outsinre outsinre requested a review from chronolaw May 13, 2024 07:29
@bungle
Copy link
Member

bungle commented May 13, 2024

@outsinre why is this reopened. I already said that this defeats the whole purpose of coroutine mutex?

@bungle bungle marked this pull request as draft May 13, 2024 07:55
Coroutine mutex uses semaphore to ensure only one job
is running for a specific event (e.g. reconfigure_handler).

However, if the coroutine job is wrapped within a light thread
(e.g. worker event's event_thread) and the associated thread
was killed in the middle of the job, semaphore post would be
skipped.

Subsequent jobs of the same type would wait and timeout, as
there is no semaphore resources available.
@outsinre outsinre force-pushed the FTI-5930-file-log-plugin-failing-after-upgrade-to-3-5-0-1 branch from 08caa04 to 51f8031 Compare May 14, 2024 12:18
@outsinre outsinre force-pushed the FTI-5930-file-log-plugin-failing-after-upgrade-to-3-5-0-1 branch from d2583eb to 6be1650 Compare May 14, 2024 13:03
@@ -125,11 +135,13 @@ function concurrency.with_coroutine_mutex(opts, fn)
end
end

running_jobs[opts_name] = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chance to be killed here is quite low.

@outsinre outsinre closed this May 17, 2024
@outsinre
Copy link
Contributor Author

outsinre commented May 17, 2024

Will do on events lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants