Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nameserver_probe_callback() accesses a nameserver object freed by evdns_base_clear_nameservers_and_suspend() #1506

Open
stablebits opened this issue Aug 15, 2023 · 6 comments

Comments

@stablebits
Copy link

We are using an older version of libsdwan but, as far as I can see, the current latest upstream version seems to have the same issue.

What we see (sentry reports, didn't reproduce locally yet):

  1. reply_run_callback() -> nameserver_probe_callback() -> crash
  2. according to logs (breadcrumbs), evdns_base_clear_nameservers_and_suspend() was called right before the crash.

What I see in the code:

Before destroying nameserver objects, evdns_base_clear_nameservers_and_suspend() does

		if (server->probe_request) {
			evdns_cancel_request(server->base, server->probe_request);
			server->probe_request = NULL;
		}

https://github.com/libevent/libevent/blob/c9ec6aafb6a025f03636ae2ae7c18a2d4b8a7ed8/evdns.c#L3134C1-L3137C4

but evdns_cancel_request() exits when pending_cb is set

libevent/evdns.c

Line 3655 in c9ec6aa

if (handle->pending_cb) {

which means that reply_schedule_callback() was already called for this request and the callback is waiting to be run by the event loop.

Once it runs, reply_run_callback() -> nameserver_probe_callback() runs and accesses the freed nameserver object (which is either still a freed or reused block of memory at this time).

A quick/partial fix could probably be something like this:

diff --git a/evdns.c b/evdns.c
index 19efbde9..bd1cf099 100644
--- a/evdns.c
+++ b/evdns.c
@@ -3132,7 +3132,12 @@ evdns_base_clear_nameservers_and_suspend(struct evdns_base *base)
                if (evtimer_initialized(&server->timeout_event))
                        (void) evtimer_del(&server->timeout_event);
                if (server->probe_request) {
-                       evdns_cancel_request(server->base, server->probe_request);
+            if (server->probe_request->pending_cb) {
+                server->probe_request->have_reply = 0;
+                server->probe_request->err = DNS_ERR_CANCEL;
+            } else {
+                evdns_cancel_request(server->base, server->probe_request);
+            }
                        server->probe_request = NULL;
                }
                if (server->socket >= 0)

reply.data.raw and reply.cname are still freed by reply_run_callback() but the object is potentially fragile in this case (e.g. perhaps do a cleanup when resetting have_reply).

Does this scenario look plausible?

@azat
Copy link
Member

azat commented Aug 15, 2023

Thank you for such a detailed report!
Your analysis looks correct.

I don't remember the code precisely, but likely this simple fix will not work in multi-threaded environment, when events of the event_base could be processed from multiple threads, since in this case this deferred callback can be runned in parallel with evdns_base_clear_nameservers_and_suspend, and this means that you cannot modify any members of it.

It is "ok" right now because pending_cb checked/modified under evdns_base lock, but actually after a brief look there are more problems here, since you may get use-after-free because reply_run_callback could be already executed and free then handler, while you will look at some of it's fields...

This can be fixed by adding some refcnt (like the code for bufferevent does), but this will require locks right now (since atomic is "on pause" - #1378), though unlikely this will be a hot path

We are using an older version of libsdwan but, as far as I can see, the current latest upstream version seems to have the same issue.

Out of curiosity, what is libsdwan?

@stablebits
Copy link
Author

stablebits commented Aug 16, 2023

Out of curiosity, what is libsdwan?

oh, I meant to say we are using an older version of libevent :-) libsdwan is our internal library.

@azat
Copy link
Member

azat commented Nov 26, 2023

@stablebits you were writing about the fix, any progress? or we can close this?

@stablebits
Copy link
Author

stablebits commented Nov 27, 2023

@azat my last comment was incorrect and because there was no reaction from you yet, I simply deleted it.

Below is my initial take (I got distracted by other activitie, so this hasn't been tested yet):

evdns-fix.patch

Comments:

evdns_base_clear_nameservers_and_suspend() -> evdns_cancel_request() is accessing probe_request (struct evdns_request) while holding the base lock. They see probe_request through server->probe_request, which suggests that nameserver_probe_callback() hasn't ran yet. Otherwise, server->probe_request would have been NULL. nameserver_probe_callback() resets it while holding ns->base, which is the same base lock. As long as reply_run_callback() always runs some handle->user_callback (in our case nameserver_probe_callback()) before destroying evdns_request, at least the object itself should still exist if server->probe_request != NULL.

However, there are 2 issues. As you mentioned, while evdns_base_clear_nameserver_and_suspend() -> evdns_cancel_request() is accessing probe_request, reply_run_callback() may already be running (not just scheduled to run) and so:

  1. It accesses handle->* fields without holding the base lock, which means that evdns_base_clear_* can't change them safely.
  2. nameserver_probe_callback() may be waiting for the base lock here (*):
     if (result == DNS_ERR_CANCEL) {
         /* We canceled this request because the nameserver came up
          * for some other reason.  Do not change our opinion about
          * the nameserver. */
         return;
     }

     EVDNS_LOCK(ns->base);     <===== Waiting here (*)
     ns->probe_request = NULL;
     if (result == DNS_ERR_NONE || result == DNS_ERR_NOTEXIST) {
         /* this is a good reply */
         nameserver_up(ns);
     } else {
[...]

ready to access the destroyed ns object once the base lock becomes available.

In order to address these 2 issues, this patch wraps the code accessing handle and calinghandle->user_callback in reply_run_callback with the base lock.

It may well be that I missed some cases though. We'll run this code on our systems for a 1-2 weeks to see whether it solves the problem (there are a few crashes per day on average).

@azat
Copy link
Member

azat commented Feb 18, 2024

Hi @stablebits, sorry for the delay I couldn't allocate time for libevent for quit a long time...

evdns_base_clear_nameservers_and_suspend() -> evdns_cancel_request() is accessing probe_request (struct evdns_request) while holding the base lock. They see probe_request through server->probe_request, which suggests that nameserver_probe_callback() hasn't ran yet. Otherwise, server->probe_request would have been NULL. nameserver_probe_callback() resets it while holding ns->base, which is the same base lock. As long as reply_run_callback() always runs some handle->user_callback (in our case nameserver_probe_callback()) before destroying evdns_request, at least the object itself should still exist if server->probe_request != NULL.

That is correct.

In order to address these 2 issues, this patch wraps the code accessing handle and calinghandle->user_callback in reply_run_callback with the base lock.

Yes, your analysis looks correct.

Can you please submit your patch as a PR? With a few comments:

  • reply_run_callback don't need user_pointer anymore
  • nameserver_probe_callback do not need lock anymore (since reply_run_callback already holds it)

Apart from it, great analysis, thank you!

@azat azat added this to the release-2.2 milestone Feb 18, 2024
@stablebits
Copy link
Author

@azat here is a pull request #1561, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants