Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update standard_task_runner.py #39543

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Update standard_task_runner.py #39543

wants to merge 1 commit into from

Conversation

RNHTTR
Copy link
Collaborator

@RNHTTR RNHTTR commented May 10, 2024

This can be caused by any number of things. Suggesting OOM can send a user down a troubleshooting path that is not helpful.

This can be caused by any number of things. Suggesting OOM can send a user down a troubleshooting path that is not helpful.
@boring-cyborg boring-cyborg bot added the area:Scheduler Scheduler or dag parsing Issues label May 10, 2024
@Taragolis
Copy link
Contributor

This can be caused by any number of things.

Maybe instead of remove describe that kind of things and how it could be analysed.

@potiuk
Copy link
Member

potiuk commented May 10, 2024

Yep. I think this is is an oportunity to send a user on a GOOD path (or at least explain various reasons and possibly send the user to some troubleshooting documentation). Otherwise this new message gives absolutely no clue whatsoever (which will mean they will come an open issue because they will have no idea what to do). The OOM at least could get the user go to A direction where they started looking at possible reasons and maybe they could find something.

@RNHTTR
Copy link
Collaborator Author

RNHTTR commented May 16, 2024

Yep. I think this is is an oportunity to send a user on a GOOD path (or at least explain various reasons and possibly send the user to some troubleshooting documentation). Otherwise this new message gives absolutely no clue whatsoever (which will mean they will come an open issue because they will have no idea what to do). The OOM at least could get the user go to A direction where they started looking at possible reasons and maybe they could find something.

I don't have a good recommendation. A more likely path for OOM (at least in my experience) is the task becoming a zombie rather than Airflow failing the task in a more typical manner. I've seen this error message somewhat frequently, but I don't know that I've ever seen it as a result of OOM.

@RNHTTR
Copy link
Collaborator Author

RNHTTR commented May 16, 2024

@amoghrajesh had a good response to a user in Slack:

-9 is a SIGKILL.
You can handle using using the on_kill callback in your Operator.

Perhaps it could recommend that users add an on_kill operator?

@amoghrajesh
Copy link
Contributor

@RNHTTR good recommendation. What if we document that better, or do something even better.
In the on_kill callback for the baseoperator, we can add enough information for the users to send them in various debugging paths. TLDR; Add all the possible causes we can think of there. Since on_kill callback will only be called in case of, well kill.

    def on_kill(self):
        self.log.info("SIGKILL was called. It could be because of: a)...b)....")

@potiuk
Copy link
Member

potiuk commented May 19, 2024

@RNHTTR good recommendation. What if we document that better, or do something even better. In the on_kill callback for the baseoperator, we can add enough information for the users to send them in various debugging paths. TLDR; Add all the possible causes we can think of there. Since on_kill callback will only be called in case of, well kill.

    def on_kill(self):
        self.log.info("SIGKILL was called. It could be because of: a)...b)....")

SIGKILL will ever trigger the on_kill. The -9 signal is not possible to handle really in "on_kill" - this is why we are guessing here why processes were killed. The "on_kill" method name has really no relation to SIGKILL (-9) - it's called when the task was stopped more gracefully rather by -9.

I think the right approach is to explain more what happens - current description is rather vague. Here that the task process was killed externally by -9, and have possible reasons why it might happen. OOM is one of the reasons, but there are other reasons - for example when machine/pod is evicted, -9 might be sent to all the processes when they are not responsive to other attempts to kill. I think it would be great maybe to get a little more description on all that and give the user some direction to look for - usually it's a signal sent by the deployment (K8S) but likely there might be other reasons - I think also Airflow standard task runner heartbeat might actually sigkill such process if it becomes unresponsive (and likely there is another log written in this case somewhere) - it would be worth to check it. So, just a few things listed here as possible reasons (and making sure it is open-ended) could be useful. Maybe even somewhere in our FAQ we should have a section "why my task can get sig-killed" and do a bit more description there.

@amoghrajesh
Copy link
Contributor

Thanks for the clarification @potiuk
I was under the impression that -9 is caught in on_kill. I agree with your suggestions and we should mention some of these to provide a debugging direction. It can have things like:

  1. The task might have been killed due to OOM
  2. The task might have been killed due to a bad underlying host
  3. The task could have been killed by your deployment manager due to x, y reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler Scheduler or dag parsing Issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants