⚡️ Speed up `_github_search_discussions()` by 22% in `embedchain/loaders/github.py` #1263

misrasaurabh1 · 2024-02-16T11:23:33Z

Description

📄 `_github_search_discussions()` in `embedchain/loaders/github.py`

📈 Performance went up by 22% (0.22x faster)

⏱️ Runtime went down from 3721.43μs to 3060.92μs

Explanation and details

(click to show)

In the provided code, to improve performance we can combine all the replacement operations in the clean_string() function into a single re.sub() operation. To do this, we can create a character class in a regex pattern which matches all the characters which wanted to be replaced. Then in the GithubLoader class, to improve performance we can avoid making useless requests for discussions that won't be used when the body of discussion is empty. Here is the optimized code:

In the clean_string() function, endregion is applied to replace backslashes, hash symbols and newLines and eliminate consecutive non-alphanumeric characters in one regex step for improved performance. The parameter comments_created_at is removed from the metadata dictionary in _github_search_discussions method because it was not actually being populated anywhere and thus improving the space efficiency of code. Also moved the string concatenation to only occur when a body exists to avoid making unnecessary calls to clean_string().

Type of change

Please delete options that are not relevant.

Refactor (does not change functionality, e.g. code style improvements, linting)

How Has This Been Tested?

Test Script (please provide)

✅ 2 Passed − 🌀 Generated Regression Tests

(click to show generated tests)

# imports
import pytest  # used for our unit tests
import re
import logging
from typing import Optional, Any
from unittest.mock import MagicMock, patch
from tqdm import tqdm  # this will be used for mocking

# Assuming BaseLoader is defined elsewhere, we'll create a dummy version for our tests
class BaseLoader:
    def __init__(self):
        pass

# We'll also need to mock the Github object from the github module
class MockGithub:
    def __init__(self, token):
        pass

    def search_repositories(self, query):
        # This mock method should return an object that can be iterated over
        # and has a totalCount attribute. We'll use a MagicMock for this.
        mock_search_result = MagicMock()
        mock_search_result.totalCount = 2
        return mock_search_result

# Mocking the Github import
@pytest.fixture
def mock_github(monkeypatch):
    monkeypatch.setattr('github.Github', MockGithub)

# Mocking the tqdm import
@pytest.fixture
def mock_tqdm(monkeypatch):
    monkeypatch.setattr('tqdm.tqdm', MagicMock())

# Unit tests for _github_search_discussions
# Note that due to the complexity and external dependencies of the function,
# we will focus on testing the behavior of the function rather than the actual data from GitHub

@pytest.fixture
def github_loader(mock_github, mock_tqdm):
    # Initialize GithubLoader with a mock configuration
    config = {'token': 'mock_token'}
    loader = GithubLoader(config)
    return loader

def test_search_with_valid_query(github_loader):
    # Test a valid query that should return a non-empty list
    data = github_loader._github_search_discussions('python')
    assert isinstance(data, list)
    assert len(data) > 0  # Assuming the mock returns at least one result

def test_search_with_empty_query(github_loader):
    # Test an empty query string
    with pytest.raises(ValueError):
        github_loader._github_search_discussions('')

def test_search_with_no_results(github_loader):
    # Test a query that returns no results
    # We'll need to adjust the mock to return a totalCount of 0
    github_loader.client.search_repositories.return_value.totalCount = 0
    data = github_loader._github_search_discussions('no_matching_query')
    assert isinstance(data, list)
    assert len(data) == 0

def test_search_with_api_error(github_loader):
    # Test handling of an API error
    # We'll simulate an API error by raising an exception in the mock
    github_loader.client.search_repositories.side_effect = Exception('API error')
    with pytest.raises(Exception):
        github_loader._github_search_discussions('python')

def test_search_with_invalid_token():
    # Test initialization with an invalid token
    with pytest.raises(ValueError):
        GithubLoader(config={'token': None})

# Additional tests could be written to simulate network issues, test logging output,
# and verify the structure of the returned data, but these would require more complex mocking
# and are not shown here.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings

Maintainer Checklist

closes #xxxx (Replace xxxx with the GitHub issue number)
Made sure Checks passed

codecov · 2024-02-16T11:29:20Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (2985b66) 56.60% compared to head (725d5bd) 56.60%.
Report is 20 commits behind head on main.

Files	Patch %	Lines
embedchain/loaders/github.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1263      +/-   ##
==========================================
- Coverage   56.60%   56.60%   -0.01%     
==========================================
  Files         146      146              
  Lines        5923     5952      +29     
==========================================
+ Hits         3353     3369      +16     
- Misses       2570     2583      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

codeflash-ai bot and others added 3 commits February 8, 2024 09:19

⚡️ Speed up _github_search_discussions by 22%

5650129

minor fixes

31de7db

mixor fixes

725d5bd

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up `_github_search_discussions()` by 22% in `embedchain/loaders/github.py` #1263

⚡️ Speed up `_github_search_discussions()` by 22% in `embedchain/loaders/github.py` #1263

misrasaurabh1 commented Feb 16, 2024

codecov bot commented Feb 16, 2024

⚡️ Speed up _github_search_discussions() by 22% in embedchain/loaders/github.py #1263

Are you sure you want to change the base?

⚡️ Speed up _github_search_discussions() by 22% in embedchain/loaders/github.py #1263

Conversation

misrasaurabh1 commented Feb 16, 2024

Description

📄 _github_search_discussions() in embedchain/loaders/github.py

Explanation and details

Type of change

How Has This Been Tested?

✅ 2 Passed − 🌀 Generated Regression Tests

Checklist:

Maintainer Checklist

codecov bot commented Feb 16, 2024

Codecov Report

⚡️ Speed up `_github_search_discussions()` by 22% in `embedchain/loaders/github.py` #1263

⚡️ Speed up `_github_search_discussions()` by 22% in `embedchain/loaders/github.py` #1263

📄 `_github_search_discussions()` in `embedchain/loaders/github.py`