Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡️ Speed up _github_search_discussions() by 22% in embedchain/loaders/github.py #1263

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

misrasaurabh1
Copy link
Contributor

Description

📄 _github_search_discussions() in embedchain/loaders/github.py

📈 Performance went up by 22% (0.22x faster)

⏱️ Runtime went down from 3721.43μs to 3060.92μs

Explanation and details

(click to show)

In the provided code, to improve performance we can combine all the replacement operations in the clean_string() function into a single re.sub() operation. To do this, we can create a character class in a regex pattern which matches all the characters which wanted to be replaced. Then in the GithubLoader class, to improve performance we can avoid making useless requests for discussions that won't be used when the body of discussion is empty. Here is the optimized code:

In the clean_string() function, endregion is applied to replace backslashes, hash symbols and newLines and eliminate consecutive non-alphanumeric characters in one regex step for improved performance. The parameter comments_created_at is removed from the metadata dictionary in _github_search_discussions method because it was not actually being populated anywhere and thus improving the space efficiency of code. Also moved the string concatenation to only occur when a body exists to avoid making unnecessary calls to clean_string().

Type of change

Please delete options that are not relevant.

  • Refactor (does not change functionality, e.g. code style improvements, linting)

How Has This Been Tested?

  • Test Script (please provide)

✅ 2 Passed − 🌀 Generated Regression Tests

(click to show generated tests)
# imports
import pytest  # used for our unit tests
import re
import logging
from typing import Optional, Any
from unittest.mock import MagicMock, patch
from tqdm import tqdm  # this will be used for mocking

# Assuming BaseLoader is defined elsewhere, we'll create a dummy version for our tests
class BaseLoader:
    def __init__(self):
        pass

# We'll also need to mock the Github object from the github module
class MockGithub:
    def __init__(self, token):
        pass

    def search_repositories(self, query):
        # This mock method should return an object that can be iterated over
        # and has a totalCount attribute. We'll use a MagicMock for this.
        mock_search_result = MagicMock()
        mock_search_result.totalCount = 2
        return mock_search_result

# Mocking the Github import
@pytest.fixture
def mock_github(monkeypatch):
    monkeypatch.setattr('github.Github', MockGithub)

# Mocking the tqdm import
@pytest.fixture
def mock_tqdm(monkeypatch):
    monkeypatch.setattr('tqdm.tqdm', MagicMock())

# Unit tests for _github_search_discussions
# Note that due to the complexity and external dependencies of the function,
# we will focus on testing the behavior of the function rather than the actual data from GitHub

@pytest.fixture
def github_loader(mock_github, mock_tqdm):
    # Initialize GithubLoader with a mock configuration
    config = {'token': 'mock_token'}
    loader = GithubLoader(config)
    return loader

def test_search_with_valid_query(github_loader):
    # Test a valid query that should return a non-empty list
    data = github_loader._github_search_discussions('python')
    assert isinstance(data, list)
    assert len(data) > 0  # Assuming the mock returns at least one result

def test_search_with_empty_query(github_loader):
    # Test an empty query string
    with pytest.raises(ValueError):
        github_loader._github_search_discussions('')

def test_search_with_no_results(github_loader):
    # Test a query that returns no results
    # We'll need to adjust the mock to return a totalCount of 0
    github_loader.client.search_repositories.return_value.totalCount = 0
    data = github_loader._github_search_discussions('no_matching_query')
    assert isinstance(data, list)
    assert len(data) == 0

def test_search_with_api_error(github_loader):
    # Test handling of an API error
    # We'll simulate an API error by raising an exception in the mock
    github_loader.client.search_repositories.side_effect = Exception('API error')
    with pytest.raises(Exception):
        github_loader._github_search_discussions('python')

def test_search_with_invalid_token():
    # Test initialization with an invalid token
    with pytest.raises(ValueError):
        GithubLoader(config={'token': None})

# Additional tests could be written to simulate network issues, test logging output,
# and verify the structure of the returned data, but these would require more complex mocking
# and are not shown here.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Made sure Checks passed

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Feb 16, 2024
Copy link

codecov bot commented Feb 16, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (2985b66) 56.60% compared to head (725d5bd) 56.60%.
Report is 20 commits behind head on main.

Files Patch % Lines
embedchain/loaders/github.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1263      +/-   ##
==========================================
- Coverage   56.60%   56.60%   -0.01%     
==========================================
  Files         146      146              
  Lines        5923     5952      +29     
==========================================
+ Hits         3353     3369      +16     
- Misses       2570     2583      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant