You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to add a new facet that would allow re-creation of DAG graph from OL events. It could look like in my example, but if anybody has any ideas on how to improve it, i would love to hear it.
# Example task dependency definition (minus the dots, it would raise the error but it's more readable)# tg1.task.id.1 >> task_3# tg1.task.id.1 >> task_4 >> task_5# task.id.with.dots.2# task_6 >> [task_7, task_8] >> task_9task_tree= {
"tg1.task.id.1": {
"task_3": {},
"task_4": {
"task_5": {}
}
},
"task.id.with.dots.2": {},
"task_6": {
"task_7": {
"task_9": {} # yes, it would be duplicated becaues of [] dependency
},
"task_8": {
"task_9": {} # yes, it would be duplicated becaues of [] dependency
},
}
}
Use case/motivation
Some OL consumers are trying to re-create DAG graph or simply deduce task dependencies from OL events. Currently it mainly works based on upstream/downstream task ids attributes delivered in AirflowRunFacet for each task. This approach has its downsides when using EmptyOperators (that may not be executed and thus won't deliver events) or Operators not included in OpenLineage (disabled using disabled-for-operators). The above make it impossible to create the dependency graph correctly.
To make it easier for the consumers, I think we should provide all the necessary information to re-create the DAG graph in the DAG START event, and for that I would like to propose adding a new facet.
We could also add the information for each task about the selective lineage being on/off for the task, Operators being disabled / enabled with config. This way, the OL consumer would know exactly what events to expect.
Also, for a dag complete/fail events we could provide them with the state for each task in a dagrun, then the consumer can compare that information with the events received and see why some events are missing f.e. tasks are skipped as there was a branching operator.
Description
I would like to add a new facet that would allow re-creation of DAG graph from OL events. It could look like in my example, but if anybody has any ideas on how to improve it, i would love to hear it.
The facet could look like:
where
taskGroups
could look like:tasks
could look like:and the
taskTree
could look like:Use case/motivation
Some OL consumers are trying to re-create DAG graph or simply deduce task dependencies from OL events. Currently it mainly works based on upstream/downstream task ids attributes delivered in AirflowRunFacet for each task. This approach has its downsides when using EmptyOperators (that may not be executed and thus won't deliver events) or Operators not included in OpenLineage (disabled using disabled-for-operators). The above make it impossible to create the dependency graph correctly.
To make it easier for the consumers, I think we should provide all the necessary information to re-create the DAG graph in the DAG START event, and for that I would like to propose adding a new facet.
Related issues
No response
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: