Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flink] Update using partial columns. #3299

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

hzjhjjyy
Copy link
Contributor

@hzjhjjyy hzjhjjyy commented May 6, 2024

Purpose

FLINK-32001 has been resolved, so partial columns can now be used for updates.

Different merge engines have different requirements(aggregation has also been supported for updates):

  1. deduplicate: only full columns can be used.
  2. partial-update: in addition to the update columns, appended columns include primary keys, partition keys, sequence field, columns defining the last_value in aggregation, and the sequence-group for update columns.
  3. aggregation: compared to partial-update, it lacks the sequence-group for update columns.

Tests

BatchUpdateWithPartialColumnsITCase (all aggregations, sequence field/group and changelog have been tested)

API and Format

Documentation

@@ -441,20 +439,6 @@ default Long generateNullable(InternalRow row, int i) {

private static class SequenceGeneratorVisitor extends DataTypeDefaultVisitor<Generator> {

@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove here?

Copy link
Contributor Author

@hzjhjjyy hzjhjjyy May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc doesn't mention support for STRING, so I removed it. Otherwise, using STRING might lead to parsing errors.

For fields..sequence-group, valid comparative data types include: DECIMAL, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DATE, TIME, TIMESTAMP, and TIMESTAMP_LTZ.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we need to do is supporting string using UserDefinedSeqComparator way.

If you are interested in this, you can do it in a separate PR, please let this PR clean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, the code has been restored. will submit a new pr.

@JingsongLi
Copy link
Contributor

Hi @hzjhjjyy , is this come from user requirement? I understand this optimization, but I'm hesitant to move forward with it.

@hzjhjjyy
Copy link
Contributor Author

Hi @hzjhjjyy , is this come from user requirement? I understand this optimization, but I'm hesitant to move forward with it.

I think this is to improve the efficiency and scope of update. Naturally, I’m also willing to decide how to handle this pr based on your advice.

@JingsongLi
Copy link
Contributor

Hi @hzjhjjyy , is this come from user requirement? I understand this optimization, but I'm hesitant to move forward with it.

I think this is to improve the efficiency and scope of update. Naturally, I’m also willing to decide how to handle this pr based on your advice.

Hi @hzjhjjyy for your inputs:

  1. efficiency of update: In updates, the biggest consumption is twofold: first, discovering this data from the file; second, rewriting the file or using MOR technology. The optimization effect of some updates is not significant.
  2. scope of update: I get this can support FieldLastValueAgg, but the default is FieldLastNonNullValueAgg.

Considering these two points, and the changes made by this PR to the current topology are not very worthwhile.

@hzjhjjyy
Copy link
Contributor Author

hzjhjjyy commented May 14, 2024

Hi @JingsongLi .
My own understanding of this pr:

  1. For Paimon, this pr is indeed no optimization regarding the calculation method for partial update. The optimization focuses only on reducing the fields retrieved and transmitted when rewriting sql for updates to selects on the flink side. Of course, this has some optimization for large tables since updates typically don't involve many fields simultaneously.
  2. Currently, updates are only provided for deduplication and partial-update. Considering the similarity between partial update and aggregation in agg functions, support for aggregation has been added (otherwise, full-field updates wouldn't be supported). In my description above, I specifically mentioned last_value just because of its uniqueness, hence the separate mention of its special treatment in this pr. Last_non_null_value can be implemented without special treatment.

Overall, this pr aims to support the feature of using partial columns in updates in flink. I wonder if my explanation clarifies and captures your intent?

@JingsongLi
Copy link
Contributor

Hi @JingsongLi . My own understanding of this pr:

  1. For Paimon, this pr is indeed no optimization regarding the calculation method for partial update. The optimization focuses only on reducing the fields retrieved and transmitted when rewriting sql for updates to selects on the flink side. Of course, this has some optimization for large tables since updates typically don't involve many fields simultaneously.
  2. Currently, updates are only provided for deduplication and partial-update. Considering the similarity between partial update and aggregation in agg functions, support for aggregation has been added (otherwise, full-field updates wouldn't be supported). In my description above, I specifically mentioned last_value just because of its uniqueness, hence the separate mention of its special treatment in this pr. Last_non_null_value can be implemented without special treatment.

Overall, this pr aims to support the feature of using partial columns in updates in flink. I wonder if my explanation clarifies and captures your intent?

Yes, I got your point, but my point is just "Is our modification worth it", we can wait for these requirements to emerge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants