fix(core): fix a deadlock issue on coroutine mutex #13019

outsinre · 2024-05-13T07:06:55Z

Summary

fix(core): fix a deadlock issue on coroutine mutex

Coroutine mutex uses semaphore to ensure only one job
is running for a specific event (e.g. reconfigure_handler).

However, if the coroutine job is wrapped within a light thread
(e.g. worker event's event_thread) and the associated thread
was killed in the middle of the job, semaphore post followed would
be skipped.

Light threads would be killed when it cannot connect to the worker
0 which is common if the worker 0 is OOM Killed by OS.

All subsequent jobs of the same event would wait and timeout, as
there is not any semaphore resource available.

Please refer to the analysis here https://konghq.atlassian.net/browse/FTI-5930?focusedCommentId=131903.

Checklist

The Pull Request has tests
A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

Fix #FTI-5930

outsinre · 2024-05-13T07:14:47Z

Restore deleted comments:

outsinre · 2024-05-13T07:13:28Z

kong/concurrency.lua

+  -- to resolve deadlock issue in case the worker event thread
+  -- associated with `pcall(fn)` below got killed
+  -- and the `:post(1)` followed is skipped.
+  if semaphore:count() <= 0 then


Restore missed comments:

@samugi we should at least post 1 for unlock loop dependencies. Post 1 is safe, just as did here: https://github.com/Kong/kong/blob/master/kong/concurrency.lua#L101-L110.

@dndx the pcall failed has multiple causes. It runs within a timer thread spawn by worker events. If the thread aborts abnormally in the middle of pcall, there is no chance to semaphore:post(1).

Why would the thread abort in the worker event callback? Is this an intended behavior?

@dndx for example, the worker 0 is OOM, then it is killed. If worker 0 is killed, then other workers cannot connect to it and error.

Other workers will close the reconfigure thread when it is still applying new config (pcall). https://github.com/Kong/lua-resty-events/blob/main/lualib/resty/events/worker.lua#L249-L253.

Again this has nothing to do with workers being killed. Coroutine mutexes don't have any meaning across process boundaries. @outsinre so please stop spreading this misinformation. I already answered in original PR (that is exactly the same as this one). And stop reopning this. It is just a bug that you are creating here (much bigger than the original it tries to fix).

Yes, this may be because other processes kill prematurely their processes (like event callbacks) when the broker is killed by OOM. But that is the bug in prematurely killing event callbacks.

bungle · 2024-05-13T07:55:00Z

@outsinre why is this reopened. I already said that this defeats the whole purpose of coroutine mutex?

Coroutine mutex uses semaphore to ensure only one job is running for a specific event (e.g. reconfigure_handler). However, if the coroutine job is wrapped within a light thread (e.g. worker event's event_thread) and the associated thread was killed in the middle of the job, semaphore post would be skipped. Subsequent jobs of the same type would wait and timeout, as there is no semaphore resources available.

outsinre · 2024-05-14T14:37:14Z

kong/concurrency.lua

@@ -125,11 +135,13 @@ function concurrency.with_coroutine_mutex(opts, fn)
    end
  end

+  running_jobs[opts_name] = true


The chance to be killed here is quite low.

outsinre · 2024-05-17T05:43:28Z

Will do on events lib.

pull-request-size bot added the size/S label May 13, 2024

github-actions bot added the cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee label May 13, 2024

github-actions bot assigned outsinre May 13, 2024

outsinre commented May 13, 2024

View reviewed changes

outsinre requested a review from chronolaw May 13, 2024 07:29

bungle marked this pull request as draft May 13, 2024 07:55

outsinre added 3 commits May 14, 2024 19:33

docs(*): add deadlock changelog

436beac

chore(*): remove misc files

51f8031

outsinre force-pushed the FTI-5930-file-log-plugin-failing-after-upgrade-to-3-5-0-1 branch from 08caa04 to 51f8031 Compare May 14, 2024 12:18

fix(*): only post when there is no other is running

6be1650

outsinre force-pushed the FTI-5930-file-log-plugin-failing-after-upgrade-to-3-5-0-1 branch from d2583eb to 6be1650 Compare May 14, 2024 13:03

chore(*): reorder two lines

6394aee

outsinre commented May 14, 2024

View reviewed changes

outsinre closed this May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): fix a deadlock issue on coroutine mutex #13019

fix(core): fix a deadlock issue on coroutine mutex #13019

outsinre commented May 13, 2024

outsinre commented May 13, 2024

outsinre May 13, 2024

outsinre May 13, 2024

dndx May 13, 2024

outsinre May 13, 2024 •

edited

bungle May 13, 2024 •

edited

bungle commented May 13, 2024 •

edited

outsinre May 14, 2024

outsinre commented May 17, 2024 •

edited

fix(core): fix a deadlock issue on coroutine mutex #13019

fix(core): fix a deadlock issue on coroutine mutex #13019

Conversation

outsinre commented May 13, 2024

Summary

Checklist

Issue reference

outsinre commented May 13, 2024

outsinre May 13, 2024

Choose a reason for hiding this comment

outsinre May 13, 2024

Choose a reason for hiding this comment

dndx May 13, 2024

Choose a reason for hiding this comment

outsinre May 13, 2024 • edited

Choose a reason for hiding this comment

bungle May 13, 2024 • edited

Choose a reason for hiding this comment

bungle commented May 13, 2024 • edited

outsinre May 14, 2024

Choose a reason for hiding this comment

outsinre commented May 17, 2024 • edited

outsinre May 13, 2024 •

edited

bungle May 13, 2024 •

edited

bungle commented May 13, 2024 •

edited

outsinre commented May 17, 2024 •

edited