Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] The topic might reference a closed ledger #22739

Closed
wants to merge 2 commits into from

Conversation

shibd
Copy link
Member

@shibd shibd commented May 17, 2024

Motivation

We observe that a normal topic might reference a closed ledger and it never auto recover. will cause the producer and customer stuck.

The root cause is related to topic create timeout. Assuming we have two concurrent requests to create a topic: firstCreate(Thread-1), secondCreate(Thread-2)

Thread-1 Thread-2
firstCreate topic timeout
remove this old topic and recreate a new topic
Open ledger from cache, and get an old ledger that create by firstCreate topic
call topic.close
old ledger close and remove it from cache
but this old ledger being referenced to new topic and that stats is close
  • When the firstCreate topic timeout, will call topic.close. it will close the ledger, and remove it from the ledger cache.

    executor().submit(() -> {
    persistentTopic.close().whenComplete((ignore, ex) -> {
    if (ex != null) {
    log.warn("[{}] Get an error when closing topic.",
    topic, ex);
    }
    });
    });

  • If the secondCreate request successfully creates a topic before the old ledger closes, the reference will be made to the old ledger.

Modifications

  • When creating a topic, if a topic future exists and it completes with a timeout exception, do not remove it; instead, return the exception directly.
  • Once the topic closure is complete, remove the topic from the topics.

Verifying this change

  • Add testCloseLedgerThatTopicAfterCreateTimeout to cover this case.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: shibd#37

@shibd shibd added type/bug The PR fixed a bug or issue reported a bug release/3.3.1 release/3.0.6 release/3.2.4 labels May 17, 2024
@shibd shibd self-assigned this May 17, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label May 17, 2024
@shibd shibd added ready-to-test and removed doc-not-needed Your PR changes do not impact docs labels May 17, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label May 17, 2024
@shibd shibd closed this May 17, 2024
@shibd shibd reopened this May 17, 2024
@shibd shibd requested a review from dao-jun May 17, 2024 13:29
@dao-jun
Copy link
Member

dao-jun commented May 18, 2024

Great work!

@@ -1434,13 +1432,6 @@ public void testCleanupTopic() throws Exception {
// Ok
}

final CompletableFuture<Optional<Topic>> timedOutTopicFuture = topicFuture;
// timeout topic future should be removed from cache
retryStrategically((test) -> pulsar1.getBrokerService().getTopic(topicName, false) != timedOutTopicFuture, 5,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test assumes a manager ledger that can never be completed to mock topic create timeout, which does not make sense in a production scenario.

Remove this assertion, and add an assertion afterward: Once ml is completed, the topic should be removed due to timeout.

@codelipenghui codelipenghui added this to the 3.4.0 milestone May 20, 2024
Comment on lines +1010 to +1011
&& FutureUtil.getException(topicFuture).get().equals(FAILED_TO_LOAD_TOPIC_TIMEOUT_EXCEPTION)) {
return FutureUtil.failedFuture(FAILED_TO_LOAD_TOPIC_TIMEOUT_EXCEPTION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to suggest to follow the solution from this PR #22283

  1. The broker should clean up all the topicFutures for any exceptions finally
  2. The new get topic operation should not remove the topicFuture that created by previous get topic operation
  3. The topicFuture should only be created if there is no previous topicFuture. If the previous topicFuture is not removed from the map yet, the broker should always use the existing topicFuture (waiting for completion or return error to the client side directly)

The existing code has mixed them which will introduce contention for the get topic operation. Now, we checked the topic load timeout exception, but don't know how many others we also need to check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with these points. Actually, I tried do that: https://github.com/shibd/pulsar/pull/38/files

But, I got some tests that not pass, So, I used this short-term solution.

I'm going to continue to look into it.

@shibd
Copy link
Member Author

shibd commented May 20, 2024

Close with this comments: #22739 (comment)

@shibd shibd closed this May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test release/3.0.6 release/3.2.4 release/3.3.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants