Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7769] Fix Hudi CDC read with legacy parquet file format on Spark #11242

Merged
merged 1 commit into from
May 30, 2024

Conversation

yihua
Copy link
Contributor

@yihua yihua commented May 16, 2024

Change Logs

The CDC releation expects InternalRow from the base and log files for merging, so we have to explicitly turn off spark.sql.parquet.enableVectorizedReader. Otherwise, the error is thrown for the CDC query with Hudi legacy parquet file format on Spark:

Job aborted due to stage failure: Task 0 in stage 84.0 failed 1 times, most recent failure: Lost task 0.0 in stage 84.0 (TID 122) (fv-az692-999.kaylvc4pbm2utmerkaq2ecni0a.ex.internal.cloudapp.net executor driver): java.lang.AssertionError
	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLongs(OnHeapColumnVector.java:389)
	at org.apache.spark.sql.vectorized.ColumnarArray.toLongArray(ColumnarArray.java:88)
	at org.apache.spark.sql.vectorized.ColumnarArray.copy(ColumnarArray.java:65)
	at org.apache.spark.sql.vectorized.ColumnarBatchRow.copy(ColumnarBatchRow.java:77)
	at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.$anonfun$loadCdcFile$1(HoodieCDCRDD.scala:443)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.loadCdcFile(HoodieCDCRDD.scala:441)
	at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.hasNextInternal(HoodieCDCRDD.scala:250)
	at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.hasNext(HoodieCDCRDD.scala:278)

Impact

Fixes CDC read on newer Spark versions.

Risk level

low

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:XS PR with lines of changes in <= 10 label May 16, 2024
yihua added a commit to yihua/hudi that referenced this pull request May 16, 2024
@yihua yihua changed the title [HUDI-7769] Fix Hudi CDC read on Spark 3.3.4 and 3.4.3 [HUDI-7769] Fix Hudi CDC read with legacy parquet file format on Spark May 17, 2024
@apache apache deleted a comment from hudi-bot May 18, 2024
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit c758508 into apache:master May 30, 2024
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XS PR with lines of changes in <= 10
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants