Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a timeout on the producer flush call in KafkaMirrorMakerConnectorTask #957

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jzakaryan
Copy link
Collaborator

Even in flushness mode BMM's tasks do a flush call on the producer when shutting down. We have observed that producer flush call tends to get indefinitely stuck and this keeps the tasks from shutting down gracefully. The code change in this PR addresses this by wrapping the producer flush call in a future and blocking on that future with a timeout.

If the producer flush doesn't complete in the given timeout window, the task will proceed to committing safe offsets and shutting down. The timeout window is exposed through a configuration property.

@jzakaryan jzakaryan changed the title Added a timeout on the producer flush call in KafkaMirrorMakerConnecorTask Added a timeout on the producer flush call in KafkaMirrorMakerConnectorTask Aug 28, 2023
Copy link
Collaborator

@shrinandthakkar shrinandthakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why shouldn't we just use the producer config offset.flush.timeout.ms instead of creating a newer config to wait on ?

ref: https://kafka.apache.org/21/documentation.html#producerconfigs

@jzakaryan
Copy link
Collaborator Author

why shouldn't we just use the producer config offset.flush.timeout.ms instead of creating a newer config to wait on ?

ref: https://kafka.apache.org/21/documentation.html#producerconfigs

@shrinandthakkar the problem is that the the tasks were found to be stuck on flush after 60 seconds despite the default value of 5 seconds for offset.flush.timeout.ms. Connector logs show that it interrupted the producer.flush to shut down the task.
We can reach out to Kafka team and ask if the behavior of LKC wrt producer flush and config values are the same as in Apache Kafka. But that can happen in parallel with us addressing it.

@shrinandthakkar
Copy link
Collaborator

why shouldn't we just use the producer config offset.flush.timeout.ms instead of creating a newer config to wait on ?
ref: https://kafka.apache.org/21/documentation.html#producerconfigs

@shrinandthakkar the problem is that the the tasks were found to be stuck on flush after 60 seconds despite the default value of 5 seconds for offset.flush.timeout.ms. Connector logs show that it interrupted the producer.flush to shut down the task. We can reach out to Kafka team and ask if the behavior of LKC wrt producer flush and config values are the same as in Apache Kafka. But that can happen in parallel with us addressing it.

@jzakaryan
Within the EventProducer's flush call, I think we already have a configuration defined (ref). And the value of that flush timeout config is INT_MAX ? Is that why we are waiting forever ?

Do you think if we should rather try to reconfigure that value for MM clusters ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants