Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Previously clustered instances incorrectly cached #5202

Closed
aksabg opened this issue May 13, 2024 · 1 comment
Closed

Previously clustered instances incorrectly cached #5202

aksabg opened this issue May 13, 2024 · 1 comment
Labels

Comments

@aksabg
Copy link

aksabg commented May 13, 2024

Questions

It seems that Vert.x caches non-existing addresses (previously shutdown) when using clustered event bus. Some eventbus send calls cause timeout exception because the listener on the address does not exist.

Version

Vert.x 4.5.5
hazelcast-kubernetes 3.2.3

Context

We are using clustered Vert.x with hazelcast kubernetes discovery. We have multiple kubernetes pods and each contains one Verticle.

Periodically we do underlying virtual machine updates in a way that we start a new virtual machine, shut down one pod (multiple replicas) on the old machine, start it on the new, repeat the process until all pods are migrated to the new machine and then delete the old machine.

It appears that somewhere in this process a split brain occured. It appears that Hazelcast was able to recover, but it seems Vert.x wasn't. It looks like Vert.x is trying to send event bus messages to addresses that no longer exist in the cluster. The only way we are able to solve this problem is to shut down all Verticles and then start them again.

Potentially relevant log statements


{
  "time": "2024-05-12T10:13:04.811512654Z",
  "level": "ERROR",
  "class": "com.myorg.MyClass",
  "message": "Received unknown error code. Error code received is -1, message received is Timed out after waiting 30000(ms) for a reply. address: __vertx.reply.d9179ccb-e25f-4e4e-9189-ff04368e4abb, repliedAddress: status/check/NotifyService."
}

{
  "time": "2024-05-12T10:06:34.812700633Z",
  "level": "WARN",
  "class": "io.vertx.core.eventbus.impl.clustered.ConnectionHolder",
  "requestId": "req-PvHlXydYJAhS7Run6wd4",
  "message": "Connecting to server d9f36bb4-029f-44d6-8f7c-304be8285b22 failed",
  "stacktrace": "io.vertx.core.impl.NoStackTraceThrowable: Not a member of the cluster\n"
}

{
  "time": "2024-05-12T10:14:34.810948660Z",
  "level": "WARN",
  "class": "io.vertx.core.eventbus.impl.clustered.ConnectionHolder",
  "requestId": "req-PvHlXydYJAhS7Run6wd4",
  "message": "Connecting to server 43e35345-2b5a-4b89-bb59-f5bf67f01780 failed",
  "stacktrace": "io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /172.21.128.0:36519\nCaused by: java.net.ConnectException: Connection refused\n\tat java.base/sun.nio.ch.Net.pollConnect(Native Method)\n\tat java.base/sun.nio.ch.Net.pollConnectNow(Unknown Source)\n\tat java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)\n\tat io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)\n\tat io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:335)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Unknown Source)\n"
}

We cannot reproduce the issue consistently, it only happens sometimes.

Any ideas on what might be going on?

@aksabg aksabg added the bug label May 13, 2024
@tsegismont tsegismont added question and removed bug labels May 27, 2024
@tsegismont
Copy link
Contributor

Hi @aksabg

This is the GH repository of the Vert.x core library, please send future reports to vertx-hazelcast

In the event of a split-brain, it is possible that subscriptions become inconsistent.

Please check out this recommendations: https://vertx.io/docs/vertx-hazelcast/java/#_recommendations

In summary, make sure you shutdown nodes gracefully and one after the other. And also add new nodes gradually.

@tsegismont tsegismont closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants