Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor][Bitbucket_Server] Speed up PR collector/extractor #7457

Open
sstojak1 opened this issue May 11, 2024 · 5 comments
Open

[Refactor][Bitbucket_Server] Speed up PR collector/extractor #7457

sstojak1 opened this issue May 11, 2024 · 5 comments
Labels
type/refactor This issue is to refactor existing code

Comments

@sstojak1
Copy link
Contributor

sstojak1 commented May 11, 2024

What and why to refactor

What are you trying to refactor? Why should it be refactored now?
The pr_collector for Bitbucket Server consistently adds the same data to the RAW_PULL_REQUEST_TABLE after each run.
Consequently, the extractApiPullRequests process slows down because it has to sift through all the records in the raw table, including duplicates.
For instance, if a repository has 1000 pull requests, after 10 job runs, the raw table will contain 10,000 rows, and extractApiPullRequests will have to process each of these records.

Was there a need to have all those history raw API data imports and because of that delete is not a feasible option?

Describe the solution you'd like

How to refactor?

  1. Perhaps we could go with deleting all records from the raw table and importing PR again with each job run. This would prevent duplicates and avoid slowing down extractApiPullRequests task.
  2. Check how Bitbucket plugin is doing it and maybe reuse the logic if it's better?

Related issues

Please link any other

Additional context

Add any other context or screenshots about the feature request here.

How to recreate:
Run Collect Data for Bitbucket Server more than once and observe the size of _raw_bitbucket_server_api_pull_requests table.

@sstojak1 sstojak1 added the type/refactor This issue is to refactor existing code label May 11, 2024
@sstojak1 sstojak1 changed the title [Refactor][Bitbucket_Server] Refactor title [Refactor][Bitbucket_Server] Speed up PR collector/extractor May 11, 2024
@Startrekzky
Copy link
Contributor

Hi @sstojak1 , it's a valid refactore, but may I know why the same data keeps been adding to the RAW_PULL_REQUEST_TABLE after each run?

@sstojak1
Copy link
Contributor Author

Hi @Startrekzky I would say that this is because BB Server API doesn't support date query parameter. So you cannot fetch PRs that were updated/created after last import job.

Because of that Devlake is importing all PRs during each job run (check how BB Cloud is not doing that since date query parameter is supported for that tool).

@klesh
Copy link
Contributor

klesh commented May 17, 2024

Based on the information you provided, it seems the collector would benefit from using the simpler ApiCollector since it would purge related records from the raw data table before saving new PR information. This might be more efficient compared to the StatefulApiCollector in this context.

@sstojak1
Copy link
Contributor Author

@klesh That might work! Is there a way to run Devlake locally in debug mode? I'd like to go through the ApiCollector impl to understand its impact on the rest of the steps for importing BB server data...

@klesh
Copy link
Contributor

klesh commented May 17, 2024

Yes, sure. You may follow this guide.

In case you wanna execute specific subtasks, you may go to the backend folder and run sth like the following:

go run plugins/jira/jira.go -c 2 -b 8 -t "extractWorklogs"

remember to change the plugin name and arguments accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/refactor This issue is to refactor existing code
Projects
None yet
Development

No branches or pull requests

3 participants