Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Producer is closed forcefully loop #846

Open
RagingPuppies opened this issue Aug 2, 2021 · 0 comments
Open

Producer is closed forcefully loop #846

RagingPuppies opened this issue Aug 2, 2021 · 0 comments

Comments

@RagingPuppies
Copy link

RagingPuppies commented Aug 2, 2021

Producer is closed forcefully loop

After restarting a broker / broker failures (anythins that trigger a leader election) seems like some brooklin TransportProviders cant self heal and get stuck in a loop.

brooklin is set with "pausePartitionOnError": "true",
A flag indicating whether to auto-pause a topic partition if dispatching its data for delivery to the destination system fails.

when brooklin producer AKA TransportProvider receives an error, it will pause following the configuration "pauseErrorPartitionDurationMs": "180000" (3 minutes).

looking at brookling logs i could find the following errors at the corresponding time of the issue:
"Flush interrupted."
"This server is not the leader for that topic-partition."
"Partition rewind failed due to"
means that at this moment, our brooklin producer is trying to work against a non-leader partition.
roughly 5 minutes later, i've witnessed the following error messages:
"Expiring 227 record(s) for <topic_name>-12: 302797 ms has passed since last append"
after comparing this with the brooklin configuration i've spotted "request.timeout.ms": "300000" which is 5 minutes.

for the next 20 minutes we received NotLeaderForPartitionException, which means we did not produced data and seems like we did not consumed.
later on theres only one exception, "Producer is closed forcefully."
reading a bit online someone said it may be that the produce can't keep with the consume,
"producersPerTask" and "numProducersPerConnector" in our configuration should do the job.
i was looking on the consumer group info and seems like it stoped consuming as well.

At the same time, we have another Datastream that replicates to the SAME cluster and topics sharing the same configurations, the failing cluster have 8 more in maxTasks,
The source of the failing Datastream is kafka remote cluster while the working one is a kafka local cluster, and the local does not fail at all, not even a single exception.

[local]Cluster A ---> Brooklin ---> Cluster C
[remote]Cluster B -----^
on remote cluster some (2~3) TransportProviders are failing.

brooklin configurations:
https://pastebin.com/raw/kHACqwcA

Your environment

  • Ubuntu 18.04
  • Brooklin 1.1.0
  • Java 1.8.0_152
  • Kafka 2.5.0
  • ZK 3.4.5

Ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant