Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] retry get spark state from spark client log when yarn queue expire #45694

Open
blanklin030 opened this issue May 15, 2024 · 0 comments · May be fixed by #45695
Open

[BugFix] retry get spark state from spark client log when yarn queue expire #45694

blanklin030 opened this issue May 15, 2024 · 0 comments · May be fixed by #45695
Labels
type/bug Something isn't working

Comments

@blanklin030
Copy link
Contributor

blanklin030 commented May 15, 2024

Steps to reproduce the behavior (Required)

background

    1. After the sr submits the spark task, an independent asynchronous thread (load etl checker) checks the execution status of the task every 5s and synchronously updates the load task status
    1. Yes Check the task status by running the following command
yarn --config configDir application -status appId 
    1. yarn maintains two lists: the running tasks in list 1 and the finished tasks in list 2. To avoid excessive memory usage, 5000 lists are configured for the number of completed applications
    1. This will cause a boundary problem. After yarn clears the appid, the asynchronous thread (load etl checker) checks the APPID after 5s. This will cause the boundary problem
2024-05-16 12:08:33,994 WARN (Load etl checker|33) [SparkEtlJobHandler.getEtlJobStatus():265] yarn application status failed. spark app id: application_1709707920318_33681730, load job id: 2946462, timeout: 30000, return code: 255, stderr: which: no /usr/local/hadoop-current/bin/yarn in ((null))
, stdout: Application with id 'application_1709707920318_33681730' doesn't exist in RM or Timeline Server.

2024-05-16 12:08:33,994 WARN (Load etl checker|33) [LoadManager.lambda$processEtlStateJobs$9():477] update load job etl status failed. job id: 2946462
com.starrocks.common.LoadException: yarn application status failed. error: which: no /usr/local/hadoop-current/bin/yarn in ((null))

        at com.starrocks.load.loadv2.SparkEtlJobHandler.getEtlJobStatus(SparkEtlJobHandler.java:268) ~[starrocks-fe.jar:?]
        at com.starrocks.load.loadv2.SparkLoadJob.updateEtlStatus(SparkLoadJob.java:308) ~[starrocks-fe.jar:?]
        at com.starrocks.load.loadv2.LoadManager.lambda$processEtlStateJobs$9(LoadManager.java:467) ~[starrocks-fe.jar:?]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:1.8.0_402]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_402]
        at java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3564) ~[?:1.8.0_402]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_402]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_402]
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:1.8.0_402]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:1.8.0_402]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_402]
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) ~[?:1.8.0_402]
        at com.starrocks.load.loadv2.LoadManager.processEtlStateJobs(LoadManager.java:465) ~[starrocks-fe.jar:?]
        at com.starrocks.load.loadv2.LoadEtlChecker.runAfterCatalogReady(LoadEtlChecker.java:42) ~[starrocks-fe.jar:?]
        at com.starrocks.common.util.LeaderDaemon.runOneCycle(LeaderDaemon.java:60) ~[starrocks-fe.jar:?]
        at com.starrocks.common.util.Daemon.run(Daemon.java:115) ~[starrocks-fe.jar:?]
    1. Looking at the following spark task execution details, we know that the time delay is 4 minutes after yarn queries the task status. At this time, yarn has cleared the app
{
  "id": "application_1709707920318_33681730",
  "name": "starrocks_xxxx__e5036056167586cbfd6ab73c73d70fad_0",
  "attempts": [
    {
      "attemptId": "1",
      "startTime": "2024-05-16T04:03:36.016GMT",
      "endTime": "2024-05-16T04:04:53.262GMT",
      "lastUpdated": "2024-05-16T04:04:53.626GMT",
      "duration": 77246,
      "sparkUser": "prod_xxx",
      "completed": true,
      "appSparkVersion": "3.2.0-104",
      "startTimeEpoch": 1715832216016,
      "endTimeEpoch": 1715832293262,
      "lastUpdatedEpoch": 1715832293626
    }
  ]
}

Expected behavior (Required)

After the spark task is submitted, the spark interaction log is maintained on the client. As long as the log is persisted to the disk, when the boundary problem occurs in the future, the spark log is parsed again to know the actual execution status of the task

Real behavior (Required)

Application with id 'application_1690789256527_3299460' doesn't exist in RM or Timeline Server.

StarRocks version (Required)

  • You can get the StarRocks version by executing SQL select current_version()
@blanklin030 blanklin030 added the type/bug Something isn't working label May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant