[Refactor][Bitbucket_Server] Speed up PR collector/extractor #7457

sstojak1 · 2024-05-11T06:08:03Z

What and why to refactor

What are you trying to refactor? Why should it be refactored now?
The pr_collector for Bitbucket Server consistently adds the same data to the RAW_PULL_REQUEST_TABLE after each run.
Consequently, the extractApiPullRequests process slows down because it has to sift through all the records in the raw table, including duplicates.
For instance, if a repository has 1000 pull requests, after 10 job runs, the raw table will contain 10,000 rows, and extractApiPullRequests will have to process each of these records.

Was there a need to have all those history raw API data imports and because of that delete is not a feasible option?

Describe the solution you'd like

How to refactor?

Perhaps we could go with deleting all records from the raw table and importing PR again with each job run. This would prevent duplicates and avoid slowing down extractApiPullRequests task.
Check how Bitbucket plugin is doing it and maybe reuse the logic if it's better?

Related issues

Please link any other

Additional context

Add any other context or screenshots about the feature request here.

How to recreate:
Run Collect Data for Bitbucket Server more than once and observe the size of _raw_bitbucket_server_api_pull_requests table.

Startrekzky · 2024-05-15T02:51:20Z

Hi @sstojak1 , it's a valid refactore, but may I know why the same data keeps been adding to the RAW_PULL_REQUEST_TABLE after each run?

sstojak1 · 2024-05-16T18:48:49Z

Hi @Startrekzky I would say that this is because BB Server API doesn't support date query parameter. So you cannot fetch PRs that were updated/created after last import job.

Because of that Devlake is importing all PRs during each job run (check how BB Cloud is not doing that since date query parameter is supported for that tool).

klesh · 2024-05-17T03:07:15Z

Based on the information you provided, it seems the collector would benefit from using the simpler ApiCollector since it would purge related records from the raw data table before saving new PR information. This might be more efficient compared to the StatefulApiCollector in this context.

sstojak1 · 2024-05-17T06:45:46Z

@klesh That might work! Is there a way to run Devlake locally in debug mode? I'd like to go through the ApiCollector impl to understand its impact on the rest of the steps for importing BB server data...

klesh · 2024-05-17T09:14:35Z

Yes, sure. You may follow this guide.

In case you wanna execute specific subtasks, you may go to the backend folder and run sth like the following:

go run plugins/jira/jira.go -c 2 -b 8 -t "extractWorklogs"

remember to change the plugin name and arguments accordingly.

sstojak1 added the type/refactor This issue is to refactor existing code label May 11, 2024

sstojak1 changed the title ~~[Refactor][Bitbucket_Server] Refactor title~~ [Refactor][Bitbucket_Server] Speed up PR collector/extractor May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor][Bitbucket_Server] Speed up PR collector/extractor #7457

[Refactor][Bitbucket_Server] Speed up PR collector/extractor #7457

sstojak1 commented May 11, 2024 •

edited

Startrekzky commented May 15, 2024

sstojak1 commented May 16, 2024

klesh commented May 17, 2024

sstojak1 commented May 17, 2024

klesh commented May 17, 2024

[Refactor][Bitbucket_Server] Speed up PR collector/extractor #7457

[Refactor][Bitbucket_Server] Speed up PR collector/extractor #7457

Comments

sstojak1 commented May 11, 2024 • edited

What and why to refactor

Describe the solution you'd like

Related issues

Additional context

Startrekzky commented May 15, 2024

sstojak1 commented May 16, 2024

klesh commented May 17, 2024

sstojak1 commented May 17, 2024

klesh commented May 17, 2024

sstojak1 commented May 11, 2024 •

edited