Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-34129 mariadb-install-db appears to hang on macOS #3253

Open
wants to merge 1 commit into
base: 10.5
Choose a base branch
from

Conversation

DaveGosselin-MariaDB
Copy link
Member

@DaveGosselin-MariaDB DaveGosselin-MariaDB commented May 13, 2024

An overcomplicated implementation of wait_for_signal_handler_to_end wouldn't join the signal_hand thread, causing the server to exit with an error.

Additionally, let's immediately close down the signal handler loop when we decide to break connections--it's the start of process termination anyway, and there's no need to wait once we've invoked break_connections.

@DaveGosselin-MariaDB DaveGosselin-MariaDB self-assigned this May 13, 2024
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Member

@vuvova vuvova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the fix. What do you mean by "we could fail to initialize the signal handler but mark it as active anyway"? How could it have happened?

static void wait_for_signal_thread_to_end()
{
uint i, n_waits= DBUG_EVALUATE("force_sighup_processing_timeout", 5, 100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you removed "force_sighup_processing_timeout", but it's used in the test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed that part of the test, too, because as far as I can see it wasn't necessary; it was only needed because we had to wait, but we had to wait because of the bug that was in the signal_hand function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add that by using pthread_join, we'll wait for all SIGHUP processing to finish, so there's no need for a special timeout. Joining pthreads is considered a best practice for the pthread library (unless the thread is created in a detached state, which isn't the case here).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rpl.rpl_shutdown_sighup continues to succeed with these changes, because the SIGHUP processing concludes as expected, without the unnecessary extra waiting semantics. Those were brittle because sleeps (and equivalent) are not reliable for multithreaded coordination.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked specifically because you did not remove that part of the test. It's line 95
https://github.com/MariaDB/server/pull/3253/files#diff-5186f012c3459b3333465ba9ba04df6972c376d56f8d591967e4fc13022b9498R95

@DaveGosselin-MariaDB
Copy link
Member Author

I don't understand the fix. What do you mean by "we could fail to initialize the signal handler but mark it as active anyway"? How could it have happened?

The signal handler will fail to initialize if my_thread_global_init_done is false, which is possible and expected in some of our tests (I don't recall which at the moment, but I think in certain plugins).

static void wait_for_signal_thread_to_end()
{
uint i, n_waits= DBUG_EVALUATE("force_sighup_processing_timeout", 5, 100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked specifically because you did not remove that part of the test. It's line 95
https://github.com/MariaDB/server/pull/3253/files#diff-5186f012c3459b3333465ba9ba04df6972c376d56f8d591967e4fc13022b9498R95

signal_thread_in_use= 1;

// initialize this newly created thread
if (my_thread_init()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can this fail? what's the realistic scenario when it fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can fail only in one of two ways:

  • calloc fails (extremely unlikely)
  • !my_thread_global_init_done is true at start of my_thread_init

I originally wrote this on 11.4 and ported it back to 10.5 before posting. Now running on latest 11.4, there are new macOS test failures. I cannot be sure that there was a real world failure, and the slow startup and 'hanging' install are both solved without this change. From the code perspective, calling my_end (which calls my_thread_global_end and sets my_thread_global_init_done to 0) will cause my_thread_init to return with an error, leaving the thread started but not initialized. In the server, we call my_end only once but in tests we call it in multiple places. During initial development, I had an abort() in the body of this conditional to see if we had a case where the init would fail, and I'm certain I saw it, but now cannot replicate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all that said, IMHO we should ship this fragment with the rest of the PR because there technically is a problem here without it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • my_thread_global_init_done is true at start of my_thread_init

is false, you mean. And I don't see how it can happen. The server calls my_init() at the beginning of mysqld_main()

May be it's set back to false because of some early shutdown? This sounds more plausible, but I still couldn't find how it could happen

@DaveGosselin-MariaDB
Copy link
Member Author

I asked specifically because you did not remove that part of the test. It's line 95

Ahh... this was an oversight on my part, I removed that in the latest diff.

An overcomplicated implementation of wait_for_signal_handler_to_end
wouldn't join the signal_hand thread, causing the server to exit with
an error.

Additionally, let's immediately close down the signal handler loop when
we decide to break connections--it's the start of process termination
anyway, and there's no need to wait once we've invoked break_connections.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants