-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add indexes on dag_id column in referencing tables to speed up deletion of dag records #39526
Add indexes on dag_id column in referencing tables to speed up deletion of dag records #39526
Conversation
e505efe
to
e498ea5
Compare
Seems like for MySQL indexes for FK already exists |
Wondering if I should put these operations under
or
WDYT is better? |
"""Apply Add index on dag_id in referencing tables.""" | ||
conn = op.get_bind() | ||
# MySQL already indexes the foreign keys, hence only create indexes for postgres and sqlite. | ||
if conn.dialect.name in ("postgresql", "sqlite"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to include explicit checks for PostgreSQL and SQLite here because we're aware that these database engines don't automatically create indexes on foreign keys. By doing so, if we introduce support for another database engine in the future, whose behaviour is similar to MySQL, which follows the same behaviour of creating indexes by default on foreign keys, this migration won't encounter issues. Otherwise, without these explicit checks, the migration might fail if the condition conn.dialect.name != "mysql"
were present.
For any additional databases we decide to support, we can create similar migrations tailored to their behavior regarding index creation on foreign keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to declare the indexes to match? In SQLAlchemy 2.0 there’s ddl_if
that can do the right thing, but I’m not sure if there’s an equivalent in 1.4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if you are suggesting this because of the failing test? I do not find such a reference in our previous migrations.
I am guessing the mismatch is because we're adding the Index in the SQLAlchemy models but it's not necessary for MySQL since they are created by default. I tried removing it from the model and having it just in the migrations. With that it succeeds for MySQL, but then complains for PostgreSQL and SQLite about the mismatch between db schema and SQLAlchemy models.
@ephraimbuddy would you have some inputs for me here based on your previous migrations expertise :) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m mostly suggesting because having matching declaration and migration seems like a reasonable thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to get what's in the ORM to match what's in migration file is to use autogenerate instead of writing the migration file yourself. Delete the migration file you created and run the below command to generate the migration file:
airflow db reset --use-migration-files
then cd airflow
and run autogenerate:
alembic revision -m "Update indexes" --autogenerate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an update is here that I created a fresh PR #39638 to apply Ephraim's suggestion and also this branch contains conflicts as a newer migration is merged to main, so easier to resolve those in a newer branch cut from main.
We're trying to work things out in that PR. I will close this PR soon if we can make things work in that PR.
Closing in favour of #39638 |
When the dag records count gets huge, and users try to delete
DAG and DAG runs that are no longer needed or are stale, it
is observed that the deletion is significantly slow. The reason
for this is that the CASCADING DELETES are slow. Although,
we have foreign key constraints in the referencing tables, they
do not create an index implicitly on those columns (dag_id in
the referencing tables in this case). Hence, we're creating indexes
on the 6 referencing table for CASCADE DELETES to speed up
the deletion of records. Without these indexes, it was observed
that it takes many hours to delete those records and it reduced
to a few seconds after adding those indexes.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.