Add indexes on dag_id column in referencing tables to speed up deletion of dag records #39526

pankajkoti · 2024-05-09T15:20:39Z

When the dag records count gets huge, and users try to delete
DAG and DAG runs that are no longer needed or are stale, it
is observed that the deletion is significantly slow. The reason
for this is that the CASCADING DELETES are slow. Although,
we have foreign key constraints in the referencing tables, they
do not create an index implicitly on those columns (dag_id in
the referencing tables in this case). Hence, we're creating indexes
on the 6 referencing table for CASCADE DELETES to speed up
the deletion of records. Without these indexes, it was observed
that it takes many hours to delete those records and it reduced
to a few seconds after adding those indexes.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

… of dag records

Taragolis · 2024-05-09T15:36:14Z

Seems like for MySQL indexes for FK already exists

pankajkoti · 2024-05-09T17:31:26Z

Seems like for MySQL indexes for FK already exists

Wondering if I should put these operations under

if conn.dialect.name != "mysql":
    create_index....

or

if conn.dialect.name in  ("sqlite", "postgresql"):
    create_index....

WDYT is better?

pankajkoti · 2024-05-13T07:31:47Z

airflow/migrations/versions/0143_2_9_2_add_index_on_dag_id_in_referencing_tables.py

+    """Apply Add index on dag_id in referencing tables."""
+    conn = op.get_bind()
+    # MySQL already indexes the foreign keys, hence only create indexes for postgres and sqlite.
+    if conn.dialect.name in ("postgresql", "sqlite"):


I opted to include explicit checks for PostgreSQL and SQLite here because we're aware that these database engines don't automatically create indexes on foreign keys. By doing so, if we introduce support for another database engine in the future, whose behaviour is similar to MySQL, which follows the same behaviour of creating indexes by default on foreign keys, this migration won't encounter issues. Otherwise, without these explicit checks, the migration might fail if the condition conn.dialect.name != "mysql" were present.

For any additional databases we decide to support, we can create similar migrations tailored to their behavior regarding index creation on foreign keys.

Do we also need to declare the indexes to match? In SQLAlchemy 2.0 there’s ddl_if that can do the right thing, but I’m not sure if there’s an equivalent in 1.4.

Curious if you are suggesting this because of the failing test? I do not find such a reference in our previous migrations.

I am guessing the mismatch is because we're adding the Index in the SQLAlchemy models but it's not necessary for MySQL since they are created by default. I tried removing it from the model and having it just in the migrations. With that it succeeds for MySQL, but then complains for PostgreSQL and SQLite about the mismatch between db schema and SQLAlchemy models.

@ephraimbuddy would you have some inputs for me here based on your previous migrations expertise :) ?

I’m mostly suggesting because having matching declaration and migration seems like a reasonable thing to do.

One way to get what's in the ORM to match what's in migration file is to use autogenerate instead of writing the migration file yourself. Delete the migration file you created and run the below command to generate the migration file:
airflow db reset --use-migration-files
then cd airflow and run autogenerate:
alembic revision -m "Update indexes" --autogenerate

an update is here that I created a fresh PR #39638 to apply Ephraim's suggestion and also this branch contains conflicts as a newer migration is merged to main, so easier to resolve those in a newer branch cut from main.

We're trying to work things out in that PR. I will close this PR soon if we can make things work in that PR.

pankajkoti · 2024-05-15T19:34:58Z

Closing in favour of #39638

boring-cyborg bot added area:db-migrations PRs with DB migration kind:documentation labels May 9, 2024

Add indexes on dag_id column in refencing tables to speed up deletion…

e498ea5

… of dag records

pankajkoti force-pushed the idx-optimise-slow-deletion-of-dags branch from e505efe to e498ea5 Compare May 9, 2024 15:21

pankajkoti requested review from Taragolis, ephraimbuddy and jedcunningham May 9, 2024 15:22

pankajkoti marked this pull request as ready for review May 9, 2024 15:22

pankajkoti requested review from potiuk, kaxil, XD-DENG and ashb as code owners May 9, 2024 15:22

pankajkoti requested a review from dstandish May 9, 2024 17:32

kaxil added this to the Airflow 2.9.2 milestone May 10, 2024

Add indexes for FKs only in case of PostgreSQL and SQLite

3ce4fa7

pankajkoti commented May 13, 2024

View reviewed changes

phanikumv changed the title ~~Add indexes on dag_id column in refencing tables to speed up deletion of dag records~~ Add indexes on dag_id column in referencing tables to speed up deletion of dag records May 15, 2024

pankajkoti marked this pull request as draft May 15, 2024 13:45

kaxil modified the milestones: Airflow 2.9.2, Airflow 2.10.0 May 15, 2024

pankajkoti closed this May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add indexes on dag_id column in referencing tables to speed up deletion of dag records #39526

Add indexes on dag_id column in referencing tables to speed up deletion of dag records #39526

pankajkoti commented May 9, 2024

Taragolis commented May 9, 2024

pankajkoti commented May 9, 2024 •

edited

pankajkoti May 13, 2024

uranusjr May 14, 2024

pankajkoti May 14, 2024

uranusjr May 15, 2024

ephraimbuddy May 15, 2024

pankajkoti May 15, 2024 •

edited

pankajkoti commented May 15, 2024

Add indexes on dag_id column in referencing tables to speed up deletion of dag records #39526

Add indexes on dag_id column in referencing tables to speed up deletion of dag records #39526

Conversation

pankajkoti commented May 9, 2024

Taragolis commented May 9, 2024

pankajkoti commented May 9, 2024 • edited

pankajkoti May 13, 2024

Choose a reason for hiding this comment

uranusjr May 14, 2024

Choose a reason for hiding this comment

pankajkoti May 14, 2024

Choose a reason for hiding this comment

uranusjr May 15, 2024

Choose a reason for hiding this comment

ephraimbuddy May 15, 2024

Choose a reason for hiding this comment

pankajkoti May 15, 2024 • edited

Choose a reason for hiding this comment

pankajkoti commented May 15, 2024

pankajkoti commented May 9, 2024 •

edited

pankajkoti May 15, 2024 •

edited