Previously clustered instances incorrectly cached #5202

aksabg · 2024-05-13T16:40:16Z

Questions

It seems that Vert.x caches non-existing addresses (previously shutdown) when using clustered event bus. Some eventbus send calls cause timeout exception because the listener on the address does not exist.

Version

Vert.x 4.5.5
hazelcast-kubernetes 3.2.3

Context

We are using clustered Vert.x with hazelcast kubernetes discovery. We have multiple kubernetes pods and each contains one Verticle.

Periodically we do underlying virtual machine updates in a way that we start a new virtual machine, shut down one pod (multiple replicas) on the old machine, start it on the new, repeat the process until all pods are migrated to the new machine and then delete the old machine.

It appears that somewhere in this process a split brain occured. It appears that Hazelcast was able to recover, but it seems Vert.x wasn't. It looks like Vert.x is trying to send event bus messages to addresses that no longer exist in the cluster. The only way we are able to solve this problem is to shut down all Verticles and then start them again.

Potentially relevant log statements


{
  "time": "2024-05-12T10:13:04.811512654Z",
  "level": "ERROR",
  "class": "com.myorg.MyClass",
  "message": "Received unknown error code. Error code received is -1, message received is Timed out after waiting 30000(ms) for a reply. address: __vertx.reply.d9179ccb-e25f-4e4e-9189-ff04368e4abb, repliedAddress: status/check/NotifyService."
}

{
  "time": "2024-05-12T10:06:34.812700633Z",
  "level": "WARN",
  "class": "io.vertx.core.eventbus.impl.clustered.ConnectionHolder",
  "requestId": "req-PvHlXydYJAhS7Run6wd4",
  "message": "Connecting to server d9f36bb4-029f-44d6-8f7c-304be8285b22 failed",
  "stacktrace": "io.vertx.core.impl.NoStackTraceThrowable: Not a member of the cluster\n"
}

{
  "time": "2024-05-12T10:14:34.810948660Z",
  "level": "WARN",
  "class": "io.vertx.core.eventbus.impl.clustered.ConnectionHolder",
  "requestId": "req-PvHlXydYJAhS7Run6wd4",
  "message": "Connecting to server 43e35345-2b5a-4b89-bb59-f5bf67f01780 failed",
  "stacktrace": "io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /172.21.128.0:36519\nCaused by: java.net.ConnectException: Connection refused\n\tat java.base/sun.nio.ch.Net.pollConnect(Native Method)\n\tat java.base/sun.nio.ch.Net.pollConnectNow(Unknown Source)\n\tat java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)\n\tat io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)\n\tat io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:335)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Unknown Source)\n"
}

We cannot reproduce the issue consistently, it only happens sometimes.

Any ideas on what might be going on?

The text was updated successfully, but these errors were encountered:

tsegismont · 2024-05-27T16:53:29Z

Hi @aksabg

This is the GH repository of the Vert.x core library, please send future reports to vertx-hazelcast

In the event of a split-brain, it is possible that subscriptions become inconsistent.

Please check out this recommendations: https://vertx.io/docs/vertx-hazelcast/java/#_recommendations

In summary, make sure you shutdown nodes gracefully and one after the other. And also add new nodes gradually.

aksabg added the bug label May 13, 2024

tsegismont added question and removed bug labels May 27, 2024

tsegismont closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Previously clustered instances incorrectly cached #5202

Previously clustered instances incorrectly cached #5202

aksabg commented May 13, 2024

tsegismont commented May 27, 2024

Previously clustered instances incorrectly cached #5202

Previously clustered instances incorrectly cached #5202

Comments

aksabg commented May 13, 2024

Questions

Version

Context

tsegismont commented May 27, 2024