Update standard_task_runner.py #39543

RNHTTR · 2024-05-10T12:33:58Z

This can be caused by any number of things. Suggesting OOM can send a user down a troubleshooting path that is not helpful.

Taragolis · 2024-05-10T12:37:40Z

This can be caused by any number of things.

Maybe instead of remove describe that kind of things and how it could be analysed.

potiuk · 2024-05-10T13:13:09Z

Yep. I think this is is an oportunity to send a user on a GOOD path (or at least explain various reasons and possibly send the user to some troubleshooting documentation). Otherwise this new message gives absolutely no clue whatsoever (which will mean they will come an open issue because they will have no idea what to do). The OOM at least could get the user go to A direction where they started looking at possible reasons and maybe they could find something.

RNHTTR · 2024-05-16T17:51:53Z

Yep. I think this is is an oportunity to send a user on a GOOD path (or at least explain various reasons and possibly send the user to some troubleshooting documentation). Otherwise this new message gives absolutely no clue whatsoever (which will mean they will come an open issue because they will have no idea what to do). The OOM at least could get the user go to A direction where they started looking at possible reasons and maybe they could find something.

I don't have a good recommendation. A more likely path for OOM (at least in my experience) is the task becoming a zombie rather than Airflow failing the task in a more typical manner. I've seen this error message somewhat frequently, but I don't know that I've ever seen it as a result of OOM.

RNHTTR · 2024-05-16T20:08:53Z

@amoghrajesh had a good response to a user in Slack:

-9 is a SIGKILL.
You can handle using using the on_kill callback in your Operator.

Perhaps it could recommend that users add an on_kill operator?

amoghrajesh · 2024-05-17T04:01:39Z

@RNHTTR good recommendation. What if we document that better, or do something even better.
In the on_kill callback for the baseoperator, we can add enough information for the users to send them in various debugging paths. TLDR; Add all the possible causes we can think of there. Since on_kill callback will only be called in case of, well kill.

    def on_kill(self):
        self.log.info("SIGKILL was called. It could be because of: a)...b)....")

potiuk · 2024-05-19T15:35:37Z

@RNHTTR good recommendation. What if we document that better, or do something even better. In the on_kill callback for the baseoperator, we can add enough information for the users to send them in various debugging paths. TLDR; Add all the possible causes we can think of there. Since on_kill callback will only be called in case of, well kill.
    def on_kill(self):
        self.log.info("SIGKILL was called. It could be because of: a)...b)....")

SIGKILL will ever trigger the on_kill. The -9 signal is not possible to handle really in "on_kill" - this is why we are guessing here why processes were killed. The "on_kill" method name has really no relation to SIGKILL (-9) - it's called when the task was stopped more gracefully rather by -9.

I think the right approach is to explain more what happens - current description is rather vague. Here that the task process was killed externally by -9, and have possible reasons why it might happen. OOM is one of the reasons, but there are other reasons - for example when machine/pod is evicted, -9 might be sent to all the processes when they are not responsive to other attempts to kill. I think it would be great maybe to get a little more description on all that and give the user some direction to look for - usually it's a signal sent by the deployment (K8S) but likely there might be other reasons - I think also Airflow standard task runner heartbeat might actually sigkill such process if it becomes unresponsive (and likely there is another log written in this case somewhere) - it would be worth to check it. So, just a few things listed here as possible reasons (and making sure it is open-ended) could be useful. Maybe even somewhere in our FAQ we should have a section "why my task can get sig-killed" and do a bit more description there.

amoghrajesh · 2024-05-20T04:17:33Z

Thanks for the clarification @potiuk
I was under the impression that -9 is caught in on_kill. I agree with your suggestions and we should mention some of these to provide a debugging direction. It can have things like:

The task might have been killed due to OOM
The task might have been killed due to a bad underlying host
The task could have been killed by your deployment manager due to x, y reasons.

Update standard_task_runner.py

b4ef119

This can be caused by any number of things. Suggesting OOM can send a user down a troubleshooting path that is not helpful.

boring-cyborg bot added the area:Scheduler Scheduler or dag parsing Issues label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update standard_task_runner.py #39543

Update standard_task_runner.py #39543

RNHTTR commented May 10, 2024

Taragolis commented May 10, 2024

potiuk commented May 10, 2024

RNHTTR commented May 16, 2024

RNHTTR commented May 16, 2024

amoghrajesh commented May 17, 2024

potiuk commented May 19, 2024

amoghrajesh commented May 20, 2024

Update standard_task_runner.py #39543

Are you sure you want to change the base?

Update standard_task_runner.py #39543

Conversation

RNHTTR commented May 10, 2024

Taragolis commented May 10, 2024

potiuk commented May 10, 2024

RNHTTR commented May 16, 2024

RNHTTR commented May 16, 2024

amoghrajesh commented May 17, 2024

potiuk commented May 19, 2024

amoghrajesh commented May 20, 2024