Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Run Merge On Read Compactions #11249

Open
jai20242 opened this issue May 16, 2024 · 6 comments
Open

[SUPPORT] Run Merge On Read Compactions #11249

jai20242 opened this issue May 16, 2024 · 6 comments
Labels
priority:minor everything else; usability gaps; questions; feature reqs table-service

Comments

@jai20242
Copy link

jai20242 commented May 16, 2024

Hi.
I have ingested data using the following configuration:

option(OPERATION_OPT_KEY, "upsert").
option(CDC_ENABLED.key(), "true").
option(TABLE_NAME, tableName).
option("hoodie.datasource.write.payload.class","CustomOverwriteWithLatestAvroPayload").
option("hoodie.avro.schema.validate","false").
option("hoodie.datasource.write.recordkey.field",keysTable.mkString(",")).
option("hoodie.datasource.write.precombine.field",COLUMN_TO_SORT).
option("hoodie.datasource.write.new.columns.nullable", "true").
option("hoodie.datasource.write.reconcile.schema","true").
option("hoodie.metadata.enable","false").
option("hoodie.index.type","SIMPLE").
option("hoodie.datasource.write.table.type","MERGE_ON_READ").
option("hoodie.compact.inline","false").
option("hoodie.datasource.write.partitionpath.field","bdp_partition").
option("hoodie.compact.inline.max.delta.commits","1").
mode(Append).
save(dataPath)

But I can't run the compaction. It doesn't work (in the configuration you can see the option hoodie.compact.inline.max.delta.commits = 1 but if I delete it and execute 2 commits happens the same)

The .hoodie folder has the following files

. .schema
.. .temp
.20240516143453846.deltacommit.crc 20240516143453846.deltacommit
.20240516143453846.deltacommit.inflight.crc 20240516143453846.deltacommit.inflight
.20240516143453846.deltacommit.requested.crc 20240516143453846.deltacommit.requested
.20240516144403250.deltacommit.crc 20240516144403250.deltacommit
.20240516144403250.deltacommit.inflight.crc 20240516144403250.deltacommit.inflight
.20240516144403250.deltacommit.requested.crc 20240516144403250.deltacommit.requested
.20240516154539132.deltacommit.crc 20240516154539132.deltacommit
.20240516154539132.deltacommit.inflight.crc 20240516154539132.deltacommit.inflight
.20240516154539132.deltacommit.requested.crc 20240516154539132.deltacommit.requested
.aux archived
.hoodie.properties.crc hoodie.properties

And a partition:

.
..
..47546248-a9d6-4a99-9c56-6dc1b0c9ad82-0_20240516143453846.log.1_1-26-114.crc
..47546248-a9d6-4a99-9c56-6dc1b0c9ad82-0_20240516143453846.log.2_1-60-262.crc
..7504f0fe-c40f-4bfa-88c0-bf905840f04b-0_20240516143453846.log.1_0-26-113.crc
..7504f0fe-c40f-4bfa-88c0-bf905840f04b-0_20240516143453846.log.2_0-60-261.crc
..hoodie_partition_metadata.crc
.47546248-a9d6-4a99-9c56-6dc1b0c9ad82-0_0-26-105_20240516143453846.parquet.crc
.47546248-a9d6-4a99-9c56-6dc1b0c9ad82-0_20240516143453846.log.1_1-26-114
.47546248-a9d6-4a99-9c56-6dc1b0c9ad82-0_20240516143453846.log.2_1-60-262
.7504f0fe-c40f-4bfa-88c0-bf905840f04b-0_1-26-106_20240516143453846.parquet.crc
.7504f0fe-c40f-4bfa-88c0-bf905840f04b-0_20240516143453846.log.1_0-26-113
.7504f0fe-c40f-4bfa-88c0-bf905840f04b-0_20240516143453846.log.2_0-60-261
.hoodie_partition_metadata
47546248-a9d6-4a99-9c56-6dc1b0c9ad82-0_0-26-105_20240516143453846.parquet
7504f0fe-c40f-4bfa-88c0-bf905840f04b-0_1-26-106_20240516143453846.parquet

Finally. I am trying to compact using command cli.

I can see two commits:

hudi:prueba->commits show --sortBy "Total Bytes Written" --desc true --limit 10
╔═══════════════════╤═════════════════════╤═══════════════════╤═════════════════════╤══════════════════════════╤═══════════════════════╤══════════════════════════════╤══════════════╗
║ CommitTime │ Total Bytes Written │ Total Files Added │ Total Files Updated │ Total Partitions Written │ Total Records Written │ Total Update Records Written │ Total Errors ║
╠═══════════════════╪═════════════════════╪═══════════════════╪═════════════════════╪══════════════════════════╪═══════════════════════╪══════════════════════════════╪══════════════╣
║ 20240516144403250 │ 752,5 MB │ 0 │ 14 │ 7 │ 1435323 │ 1435323 │ 0 ║
╟───────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20240516143453846 │ 41,7 MB │ 14 │ 0 │ 7 │ 1435323 │ 0 │ 0 ║
╚═══════════════════╧═════════════════════╧═══════════════════╧═════════════════════╧══════════════════════════╧═══════════════════════╧══════════════════════════════╧══════════════╝

But there is no compactions:

compactions show all
╔═════════════════════════╤═══════╤═══════════════════════════════╗
║ Compaction Instant Time │ State │ Total FileIds to be Compacted ║
╠═════════════════════════╧═══════╧═══════════════════════════════╣
║ (empty) ║
╚═════════════════════════════════════════════════════════════════╝

But the command compaction run returns the following message (after executing the command compaction schedule)

prueba->compaction run --tableName prueba
2024-05-16 14:17:08.633 INFO 58141 --- [ main] o.a.h.c.t.t.HoodieActiveTimeline : Loaded instants upto : Option{val=[20240516134708181__deltacommit__COMPLETED__20240516135028000]}
NO PENDING COMPACTION TO RUN

@ad1happy2go
Copy link
Contributor

@jai20242 try compaction schedule first.

@codope codope added priority:minor everything else; usability gaps; questions; feature reqs table-service labels May 16, 2024
@jai20242
Copy link
Author

jai20242 commented May 17, 2024

I tried it but it didn't work:

1) Connect
hudi->connect --path /tmp/dep_hudi2
2024-05-17 08:34:09.024 INFO 3288 --- [ main] o.a.h.c.t.HoodieTableMetaClient : Loading HoodieTableMetaClient from /tmp/dep_hudi2
2024-05-17 08:34:09.212 WARN 3288 --- [ main] o.a.h.u.NativeCodeLoader : Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-05-17 08:34:09.626 WARN 3288 --- [ main] o.a.h.f.FileSystem : Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem could not be instantiated
2024-05-17 08:34:09.626 WARN 3288 --- [ main] o.a.h.f.FileSystem : java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkState(ZLjava/lang/String;J)V
2024-05-17 08:34:09.938 INFO 3288 --- [ main] o.a.h.c.t.HoodieTableConfig : Loading table properties from /tmp/dep_hudi2/.hoodie/hoodie.properties
2024-05-17 08:34:09.949 INFO 3288 --- [ main] o.a.h.c.t.HoodieTableMetaClient : Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from /tmp/dep_hudi2
Metadata for table prueba loaded

2) Show commits
hudi:prueba->commits show --sortBy "Total Bytes Written" --desc true --limit 10
2024-05-17 08:34:19.987 INFO 3288 --- [ main] o.a.h.c.t.t.HoodieActiveTimeline : Loaded instants upto : Option{val=[20240517082810322__deltacommit__COMPLETED__20240517083132000]}
╔═══════════════════╤═════════════════════╤═══════════════════╤═════════════════════╤══════════════════════════╤═══════════════════════╤══════════════════════════════╤══════════════╗
║ CommitTime │ Total Bytes Written │ Total Files Added │ Total Files Updated │ Total Partitions Written │ Total Records Written │ Total Update Records Written │ Total Errors ║
╠═══════════════════╪═════════════════════╪═══════════════════╪═════════════════════╪══════════════════════════╪═══════════════════════╪══════════════════════════════╪══════════════╣
║ 20240517082810322 │ 752,5 MB │ 0 │ 14 │ 7 │ 1435323 │ 1435323 │ 0 ║
╟───────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20240517082347494 │ 41,7 MB │ 14 │ 0 │ 7 │ 1435323 │ 0 │ 0 ║
╚═══════════════════╧═════════════════════╧═══════════════════╧═════════════════════╧══════════════════════════╧═══════════════════════╧══════════════════════════════╧══════════════╝

3) Run compaction
hudi:prueba->compaction run --tableName prueba
2024-05-17 08:34:24.542 INFO 3288 --- [ main] o.a.h.c.t.t.HoodieActiveTimeline : Loaded instants upto : Option{val=[20240517082810322__deltacommit__COMPLETED__20240517083132000]}
NO PENDING COMPACTION TO RUN

4) Schedule compaction
hudi:prueba->compaction schedule
Attempted to schedule compaction for 20240517083427496

5) Run compaction again
hudi:prueba->compaction run --tableName prueba
2024-05-17 08:34:34.152 INFO 3288 --- [ main] o.a.h.c.t.t.HoodieActiveTimeline : Loaded instants upto : Option{val=[20240517082810322__deltacommit__COMPLETED__20240517083132000]}
NO PENDING COMPACTION TO RUN

And the hudi path is:
1).hoodie folder
.
..
.20240517082347494.deltacommit.crc
.20240517082347494.deltacommit.inflight.crc
.20240517082347494.deltacommit.requested.crc
.20240517082810322.deltacommit.crc
.20240517082810322.deltacommit.inflight.crc
.20240517082810322.deltacommit.requested.crc
.aux
.hoodie.properties.crc
.schema
.temp
20240517082347494.deltacommit
20240517082347494.deltacommit.inflight
20240517082347494.deltacommit.requested
20240517082810322.deltacommit
20240517082810322.deltacommit.inflight
20240517082810322.deltacommit.requested
archived
hoodie.properties

2) The files in a partition
.
..
..280a7ceb-a1ee-4cd7-ba8c-9ec870164ec9-0_20240517082347494.log.1_1-60-254.crc
..ee36904f-023d-46aa-bdb2-947de4be3fd1-0_20240517082347494.log.1_0-60-253.crc
..hoodie_partition_metadata.crc
.280a7ceb-a1ee-4cd7-ba8c-9ec870164ec9-0_0-26-105_20240517082347494.parquet.crc
.280a7ceb-a1ee-4cd7-ba8c-9ec870164ec9-0_20240517082347494.log.1_1-60-254
.ee36904f-023d-46aa-bdb2-947de4be3fd1-0_1-26-106_20240517082347494.parquet.crc
.ee36904f-023d-46aa-bdb2-947de4be3fd1-0_20240517082347494.log.1_0-60-253
.hoodie_partition_metadata
280a7ceb-a1ee-4cd7-ba8c-9ec870164ec9-0_0-26-105_20240517082347494.parquet
ee36904f-023d-46aa-bdb2-947de4be3fd1-0_1-26-106_20240517082347494.parquet

@ad1happy2go
Copy link
Contributor

@jai20242 If you have only 2 delta commits then there will be nothing to compact as default [hoodie.compact.inline.max.delta.commits](https://hudi.apache.org/docs/configurations/#hoodiecompactinlinemaxdeltacommits) will be 5. set this config to 1 if you want to do so

@jai20242
Copy link
Author

I put the param hoodie.compact.inline.max.delta.commits to 1 (you can see it in the first comment)

@ad1happy2go
Copy link
Contributor

@jai20242 That is writer configuration. Hoodie don't save them. When you do compaction from cli. you need to pass there too

@jai20242
Copy link
Author

I tried it adding the configuration using compaction schedule and compaction run but it didn't work.

hudi->connect --path /tmp/dep_hudi2
2024-05-17 13:25:30.737 INFO 21882 --- [ main] o.a.h.c.t.HoodieTableMetaClient : Loading HoodieTableMetaClient from /tmp/dep_hudi2
2024-05-17 13:25:30.906 WARN 21882 --- [ main] o.a.h.u.NativeCodeLoader : Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-05-17 13:25:31.243 WARN 21882 --- [ main] o.a.h.f.FileSystem : Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem could not be instantiated
2024-05-17 13:25:31.243 WARN 21882 --- [ main] o.a.h.f.FileSystem : java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkState(ZLjava/lang/String;J)V
2024-05-17 13:25:31.442 INFO 21882 --- [ main] o.a.h.c.t.HoodieTableConfig : Loading table properties from /tmp/dep_hudi2/.hoodie/hoodie.properties
2024-05-17 13:25:31.457 INFO 21882 --- [ main] o.a.h.c.t.HoodieTableMetaClient : Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from /tmp/dep_hudi2
Metadata for table prueba loaded
hudi:prueba->compaction schedule —hoodieConfigs "hoodie.compact.inline.max.delta.commits=1"
Attempted to schedule compaction for 20240517132533480
hudi:prueba->compaction run --tableName prueba
2024-05-17 13:25:40.853 INFO 21882 --- [ main] o.a.h.c.t.t.HoodieActiveTimeline : Loaded instants upto : Option{val=[20240517082810322__deltacommit__COMPLETED__20240517083132000]}
NO PENDING COMPACTION TO RUN
hudi:prueba->compactions show all
╔═════════════════════════╤═══════╤═══════════════════════════════╗
║ Compaction Instant Time │ State │ Total FileIds to be Compacted ║
╠═════════════════════════╧═══════╧═══════════════════════════════╣
║ (empty) ║
╚═════════════════════════════════════════════════════════════════╝

hudi:prueba->compaction run --tableName prueba —hoodieConfigs "hoodie.compact.inline.max.delta.commits=1"
2024-05-17 13:26:17.293 INFO 21882 --- [ main] o.a.h.c.t.t.HoodieActiveTimeline : Loaded instants upto : Option{val=[20240517082810322__deltacommit__COMPLETED__20240517083132000]}
NO PENDING COMPACTION TO RUN
hudi:prueba->compaction run --tableName prueba —hoodieConfigs "hoodie.compact.inline.max.delta.commits=1"
2024-05-17 13:26:30.318 INFO 21882 --- [ main] o.a.h.c.t.t.HoodieActiveTimeline : Loaded instants upto : Option{val=[20240517082810322__deltacommit__COMPLETED__20240517083132000]}
NO PENDING COMPACTION TO RUN
hudi:prueba->compactions show all
╔═════════════════════════╤═══════╤═══════════════════════════════╗
║ Compaction Instant Time │ State │ Total FileIds to be Compacted ║
╠═════════════════════════╧═══════╧═══════════════════════════════╣
║ (empty) ║
╚═════════════════════════════════════════════════════════════════╝

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:minor everything else; usability gaps; questions; feature reqs table-service
Projects
Status: 👤 User Action
Development

No branches or pull requests

3 participants