[Flink] Update using partial columns. #3299

hzjhjjyy · 2024-05-06T10:53:42Z

Purpose

FLINK-32001 has been resolved, so partial columns can now be used for updates.

Different merge engines have different requirements(aggregation has also been supported for updates):

deduplicate: only full columns can be used.
partial-update: in addition to the update columns, appended columns include primary keys, partition keys, sequence field, columns defining the last_value in aggregation, and the sequence-group for update columns.
aggregation: compared to partial-update, it lacks the sequence-group for update columns.

Tests

BatchUpdateWithPartialColumnsITCase (all aggregations, sequence field/group and changelog have been tested)

API and Format

Documentation

JingsongLi · 2024-05-08T08:53:47Z

paimon-core/src/main/java/org/apache/paimon/mergetree/compact/PartialUpdateMergeFunction.java

@@ -441,20 +439,6 @@ default Long generateNullable(InternalRow row, int i) {

        private static class SequenceGeneratorVisitor extends DataTypeDefaultVisitor<Generator> {

-            @Override


Why remove here?

The doc doesn't mention support for STRING, so I removed it. Otherwise, using STRING might lead to parsing errors.

For fields..sequence-group, valid comparative data types include: DECIMAL, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DATE, TIME, TIMESTAMP, and TIMESTAMP_LTZ.

What we need to do is supporting string using UserDefinedSeqComparator way.

If you are interested in this, you can do it in a separate PR, please let this PR clean.

ok, the code has been restored. will submit a new pr.

JingsongLi · 2024-05-11T05:40:32Z

Hi @hzjhjjyy , is this come from user requirement? I understand this optimization, but I'm hesitant to move forward with it.

hzjhjjyy · 2024-05-11T06:29:32Z

Hi @hzjhjjyy , is this come from user requirement? I understand this optimization, but I'm hesitant to move forward with it.

I think this is to improve the efficiency and scope of update. Naturally, I’m also willing to decide how to handle this pr based on your advice.

JingsongLi · 2024-05-14T01:35:29Z

Hi @hzjhjjyy , is this come from user requirement? I understand this optimization, but I'm hesitant to move forward with it.

I think this is to improve the efficiency and scope of update. Naturally, I’m also willing to decide how to handle this pr based on your advice.

Hi @hzjhjjyy for your inputs:

efficiency of update: In updates, the biggest consumption is twofold: first, discovering this data from the file; second, rewriting the file or using MOR technology. The optimization effect of some updates is not significant.
scope of update: I get this can support FieldLastValueAgg, but the default is FieldLastNonNullValueAgg.

Considering these two points, and the changes made by this PR to the current topology are not very worthwhile.

hzjhjjyy · 2024-05-14T02:23:01Z

Hi @JingsongLi .
My own understanding of this pr:

For Paimon, this pr is indeed no optimization regarding the calculation method for partial update. The optimization focuses only on reducing the fields retrieved and transmitted when rewriting sql for updates to selects on the flink side. Of course, this has some optimization for large tables since updates typically don't involve many fields simultaneously.
Currently, updates are only provided for deduplication and partial-update. Considering the similarity between partial update and aggregation in agg functions, support for aggregation has been added (otherwise, full-field updates wouldn't be supported). In my description above, I specifically mentioned last_value just because of its uniqueness, hence the separate mention of its special treatment in this pr. Last_non_null_value can be implemented without special treatment.

Overall, this pr aims to support the feature of using partial columns in updates in flink. I wonder if my explanation clarifies and captures your intent?

JingsongLi · 2024-05-14T04:14:32Z

Hi @JingsongLi . My own understanding of this pr:

For Paimon, this pr is indeed no optimization regarding the calculation method for partial update. The optimization focuses only on reducing the fields retrieved and transmitted when rewriting sql for updates to selects on the flink side. Of course, this has some optimization for large tables since updates typically don't involve many fields simultaneously.

Currently, updates are only provided for deduplication and partial-update. Considering the similarity between partial update and aggregation in agg functions, support for aggregation has been added (otherwise, full-field updates wouldn't be supported). In my description above, I specifically mentioned last_value just because of its uniqueness, hence the separate mention of its special treatment in this pr. Last_non_null_value can be implemented without special treatment.

Overall, this pr aims to support the feature of using partial columns in updates in flink. I wonder if my explanation clarifies and captures your intent?

Yes, I got your point, but my point is just "Is our modification worth it", we can wait for these requirements to emerge.

hzjhjjyy added 2 commits May 6, 2024 14:16

add

6a88f00

add test and doc

bf8909b

JingsongLi reviewed May 8, 2024

View reviewed changes

revert

d56a614

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flink] Update using partial columns. #3299

[Flink] Update using partial columns. #3299

hzjhjjyy commented May 6, 2024 •

edited

JingsongLi May 8, 2024

hzjhjjyy May 8, 2024 •

edited

JingsongLi May 11, 2024

hzjhjjyy May 11, 2024

JingsongLi commented May 11, 2024

hzjhjjyy commented May 11, 2024

JingsongLi commented May 14, 2024

hzjhjjyy commented May 14, 2024 •

edited

JingsongLi commented May 14, 2024

		@@ -441,20 +439,6 @@ default Long generateNullable(InternalRow row, int i) {

		private static class SequenceGeneratorVisitor extends DataTypeDefaultVisitor<Generator> {

		@Override

[Flink] Update using partial columns. #3299

Are you sure you want to change the base?

[Flink] Update using partial columns. #3299

Conversation

hzjhjjyy commented May 6, 2024 • edited

Purpose

Tests

API and Format

Documentation

JingsongLi May 8, 2024

Choose a reason for hiding this comment

hzjhjjyy May 8, 2024 • edited

Choose a reason for hiding this comment

JingsongLi May 11, 2024

Choose a reason for hiding this comment

hzjhjjyy May 11, 2024

Choose a reason for hiding this comment

JingsongLi commented May 11, 2024

hzjhjjyy commented May 11, 2024

JingsongLi commented May 14, 2024

hzjhjjyy commented May 14, 2024 • edited

JingsongLi commented May 14, 2024

hzjhjjyy commented May 6, 2024 •

edited

hzjhjjyy May 8, 2024 •

edited

hzjhjjyy commented May 14, 2024 •

edited