Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add indexes on dag_id column in referencing tables to speed up deletion of dag records #39526

Closed

Conversation

pankajkoti
Copy link
Member

When the dag records count gets huge, and users try to delete
DAG and DAG runs that are no longer needed or are stale, it
is observed that the deletion is significantly slow. The reason
for this is that the CASCADING DELETES are slow. Although,
we have foreign key constraints in the referencing tables, they
do not create an index implicitly on those columns (dag_id in
the referencing tables in this case). Hence, we're creating indexes
on the 6 referencing table for CASCADE DELETES to speed up
the deletion of records. Without these indexes, it was observed
that it takes many hours to delete those records and it reduced
to a few seconds after adding those indexes.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@pankajkoti pankajkoti force-pushed the idx-optimise-slow-deletion-of-dags branch from e505efe to e498ea5 Compare May 9, 2024 15:21
@pankajkoti pankajkoti marked this pull request as ready for review May 9, 2024 15:22
@Taragolis
Copy link
Contributor

Seems like for MySQL indexes for FK already exists

@pankajkoti
Copy link
Member Author

pankajkoti commented May 9, 2024

Seems like for MySQL indexes for FK already exists

Wondering if I should put these operations under

if conn.dialect.name != "mysql":
    create_index....

or

if conn.dialect.name in  ("sqlite", "postgresql"):
    create_index....

WDYT is better?

@pankajkoti pankajkoti requested a review from dstandish May 9, 2024 17:32
@kaxil kaxil added this to the Airflow 2.9.2 milestone May 10, 2024
"""Apply Add index on dag_id in referencing tables."""
conn = op.get_bind()
# MySQL already indexes the foreign keys, hence only create indexes for postgres and sqlite.
if conn.dialect.name in ("postgresql", "sqlite"):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to include explicit checks for PostgreSQL and SQLite here because we're aware that these database engines don't automatically create indexes on foreign keys. By doing so, if we introduce support for another database engine in the future, whose behaviour is similar to MySQL, which follows the same behaviour of creating indexes by default on foreign keys, this migration won't encounter issues. Otherwise, without these explicit checks, the migration might fail if the condition conn.dialect.name != "mysql" were present.

For any additional databases we decide to support, we can create similar migrations tailored to their behavior regarding index creation on foreign keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to declare the indexes to match? In SQLAlchemy 2.0 there’s ddl_if that can do the right thing, but I’m not sure if there’s an equivalent in 1.4.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if you are suggesting this because of the failing test? I do not find such a reference in our previous migrations.

I am guessing the mismatch is because we're adding the Index in the SQLAlchemy models but it's not necessary for MySQL since they are created by default. I tried removing it from the model and having it just in the migrations. With that it succeeds for MySQL, but then complains for PostgreSQL and SQLite about the mismatch between db schema and SQLAlchemy models.

@ephraimbuddy would you have some inputs for me here based on your previous migrations expertise :) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m mostly suggesting because having matching declaration and migration seems like a reasonable thing to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to get what's in the ORM to match what's in migration file is to use autogenerate instead of writing the migration file yourself. Delete the migration file you created and run the below command to generate the migration file:
airflow db reset --use-migration-files
then cd airflow and run autogenerate:
alembic revision -m "Update indexes" --autogenerate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an update is here that I created a fresh PR #39638 to apply Ephraim's suggestion and also this branch contains conflicts as a newer migration is merged to main, so easier to resolve those in a newer branch cut from main.

We're trying to work things out in that PR. I will close this PR soon if we can make things work in that PR.

@phanikumv phanikumv changed the title Add indexes on dag_id column in refencing tables to speed up deletion of dag records Add indexes on dag_id column in referencing tables to speed up deletion of dag records May 15, 2024
@pankajkoti pankajkoti marked this pull request as draft May 15, 2024 13:45
@kaxil kaxil modified the milestones: Airflow 2.9.2, Airflow 2.10.0 May 15, 2024
@pankajkoti
Copy link
Member Author

Closing in favour of #39638

@pankajkoti pankajkoti closed this May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants