Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improve][Connector-V2] Improve Paimon source split enumerator #6766

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

xiaochen-zhou
Copy link
Contributor

Purpose of this pull request
Improve Paimon source split enumerator

Does this PR introduce any user-facing change?
no

How was this patch tested?
exists test

Check list

@Hisoka-X
Copy link
Member

cc @hailin0

@dailai
Copy link
Contributor

dailai commented Apr 29, 2024

Please add test case.

@xiaochen-zhou
Copy link
Contributor Author

Please add test case.

Thanks for your review. I think this is an optimization of the existing PaimonSourceSplitEnumerator. Currently, pamion's test case can already cover it.

}

public PaimonSourceSplitEnumerator(
Context<PaimonSourceSplit> context, Table table, PaimonSourceState sourceState) {
this.context = context;
this.table = table;
this.assignedSplit = sourceState.getAssignedSplits();
this.pendingSplit = new HashMap<>();
this.shouldEnumerate = sourceState == null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please optimize the code like this
this.shouldEnumerate = (sourceState == null || sourceState.isShouldEnumerate());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can refer the org.apache.seatunnel.connectors.seatunnel.file.source.split.FileSourceSplitEnumerator in which we can keep the pendingSplit and assignedSplit at the sametime. It's more logical that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can refer the org.apache.seatunnel.connectors.seatunnel.file.source.split.FileSourceSplitEnumerator in which we can keep the pendingSplit and assignedSplit at the sametime. It's more logical that way.

OK, I will optimize the code based on your suggestions

}

@Override
public void open() {
this.pendingSplit = new HashSet<>();
this.pendingSplit = new HashedMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will overwrite the init in the construct method.

public Set<PaimonSourceSplit> getAssignedSplits() {
return assignedSplits;
}
private final Map<Integer, Set<PaimonSourceSplit>> assignedSplits;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change Set to Map here? Any considerations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change Set to Map here? Any considerations?

I think there is a small optimization,avoid the traversal operation when removing the allocated PaimonSourceSplit from pendingSplit. use set like FileSourceSplitEnumerator, a nested loop will be formed in the FileSourceSplitEnumerator#run() method.

@dailai
Copy link
Contributor

dailai commented May 9, 2024

Could you implement stream read for paimon source ? I don't think this is necessary in batch read scenarios.

@xiaochen-zhou
Copy link
Contributor Author

Could you implement stream read for paimon source ? I don't think this is necessary in batch read scenarios.

Ok, I'd be happy to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants