[BugFix] retry get spark state from spark client log when yarn queue expire #45694

blanklin030 · 2024-05-15T16:46:16Z

Steps to reproduce the behavior (Required)

background

1. After the sr submits the spark task, an independent asynchronous thread (load etl checker) checks the execution status of the task every 5s and synchronously updates the load task status
1. Yes Check the task status by running the following command

yarn --config configDir application -status appId

1. yarn maintains two lists: the running tasks in list 1 and the finished tasks in list 2. To avoid excessive memory usage, 5000 lists are configured for the number of completed applications
1. This will cause a boundary problem. After yarn clears the appid, the asynchronous thread (load etl checker) checks the APPID after 5s. This will cause the boundary problem

2024-05-16 12:08:33,994 WARN (Load etl checker|33) [SparkEtlJobHandler.getEtlJobStatus():265] yarn application status failed. spark app id: application_1709707920318_33681730, load job id: 2946462, timeout: 30000, return code: 255, stderr: which: no /usr/local/hadoop-current/bin/yarn in ((null))
, stdout: Application with id 'application_1709707920318_33681730' doesn't exist in RM or Timeline Server.

2024-05-16 12:08:33,994 WARN (Load etl checker|33) [LoadManager.lambda$processEtlStateJobs$9():477] update load job etl status failed. job id: 2946462
com.starrocks.common.LoadException: yarn application status failed. error: which: no /usr/local/hadoop-current/bin/yarn in ((null))

        at com.starrocks.load.loadv2.SparkEtlJobHandler.getEtlJobStatus(SparkEtlJobHandler.java:268) ~[starrocks-fe.jar:?]
        at com.starrocks.load.loadv2.SparkLoadJob.updateEtlStatus(SparkLoadJob.java:308) ~[starrocks-fe.jar:?]
        at com.starrocks.load.loadv2.LoadManager.lambda$processEtlStateJobs$9(LoadManager.java:467) ~[starrocks-fe.jar:?]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:1.8.0_402]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_402]
        at java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3564) ~[?:1.8.0_402]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_402]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_402]
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:1.8.0_402]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:1.8.0_402]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_402]
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) ~[?:1.8.0_402]
        at com.starrocks.load.loadv2.LoadManager.processEtlStateJobs(LoadManager.java:465) ~[starrocks-fe.jar:?]
        at com.starrocks.load.loadv2.LoadEtlChecker.runAfterCatalogReady(LoadEtlChecker.java:42) ~[starrocks-fe.jar:?]
        at com.starrocks.common.util.LeaderDaemon.runOneCycle(LeaderDaemon.java:60) ~[starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]

1. Looking at the following spark task execution details, we know that the time delay is 4 minutes after yarn queries the task status. At this time, yarn has cleared the app

{
  "id": "application_1709707920318_33681730",
  "name": "starrocks_xxxx__e5036056167586cbfd6ab73c73d70fad_0",
  "attempts": [
    {
      "attemptId": "1",
      "startTime": "2024-05-16T04:03:36.016GMT",
      "endTime": "2024-05-16T04:04:53.262GMT",
      "lastUpdated": "2024-05-16T04:04:53.626GMT",
      "duration": 77246,
      "sparkUser": "prod_xxx",
      "completed": true,
      "appSparkVersion": "3.2.0-104",
      "startTimeEpoch": 1715832216016,
      "endTimeEpoch": 1715832293262,
      "lastUpdatedEpoch": 1715832293626
    }
  ]
}

Expected behavior (Required)

After the spark task is submitted, the spark interaction log is maintained on the client. As long as the log is persisted to the disk, when the boundary problem occurs in the future, the spark log is parsed again to know the actual execution status of the task

Real behavior (Required)

Application with id 'application_1690789256527_3299460' doesn't exist in RM or Timeline Server.

StarRocks version (Required)

You can get the StarRocks version by executing SQL select current_version()

The text was updated successfully, but these errors were encountered:

blanklin030 added the type/bug Something isn't working label May 15, 2024

blanklin030 linked a pull request May 15, 2024 that will close this issue

[BugFix] retry get spark state from spark client log when yarn queue expire #45695

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] retry get spark state from spark client log when yarn queue expire #45694

[BugFix] retry get spark state from spark client log when yarn queue expire #45694

blanklin030 commented May 15, 2024 •

edited

[BugFix] retry get spark state from spark client log when yarn queue expire #45694

[BugFix] retry get spark state from spark client log when yarn queue expire #45694

Comments

blanklin030 commented May 15, 2024 • edited

Steps to reproduce the behavior (Required)

background

Expected behavior (Required)

Real behavior (Required)

StarRocks version (Required)

blanklin030 commented May 15, 2024 •

edited