Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡️ Speed up get_word_count() by 6% in embedchain/chunkers/base_chunker.py #1268

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

misrasaurabh1
Copy link
Contributor

Description

📄 get_word_count() in embedchain/chunkers/base_chunker.py

📈 Performance went up by 6% (0.06x faster)

⏱️ Runtime went down from 1014.51μs to 955.61μs

Explanation and details

(click to show)

Changes:

  • Changed sum([len(document.split(" ")) for document in documents]) to sum(len(document.split(" ")) for document in documents). This avoids the memory overhead of creating a list either with list comprehensions or a map - sum accepts a generator, not requiring a separate list at all.

  • This makes code more memory efficient, and probably a bit faster, because memory allocation can be an expensive operation and it's not necessary in this case as we are only iterating once over generator and we don't need to use memory to store a list. Python doesn't have to allocate memory to generate a whole new list of word counts and then keep it around while it adds them up. Instead, it can just generate one word count, add it to the running total, and throw it away before moving on to the next.

Type of change

Please delete options that are not relevant.

  • Refactor (does not change functionality, e.g. code style improvements, linting)

How Has This Been Tested?

The new optimized code was tested for correctness. The results are listed below.

  • Test Script (please provide)

✅ 26 Passed − 🌀 Generated Regression Tests

(click to show generated tests)
# imports
import pytest  # used for our unit tests
from embedchain.chunkers.base_chunker import BaseChunker
# unit tests

# Test normal cases with known number of words
def test_normal_cases():
    assert BaseChunker.get_word_count(["The quick brown fox jumps over the lazy dog"]) == 9
    assert BaseChunker.get_word_count(["Hello world", "Python programming", "Unit testing"]) == 5

# Test with empty documents
def test_empty_documents():
    assert BaseChunker.get_word_count([""]) == 0
    assert BaseChunker.get_word_count(["", "", ""]) == 0
    assert BaseChunker.get_word_count(["", "Hello", ""]) == 1

# Test handling of whitespace
def test_whitespace_handling():
    assert BaseChunker.get_word_count(["  Leading", "Trailing  ", "Multiple   Spaces"]) == 3
    assert BaseChunker.get_word_count(["Line\nbreak", "Tab\tseparated", "Carriage\rreturn"]) == 3

# Test with punctuation and special characters
def test_special_characters_and_punctuation():
    assert BaseChunker.get_word_count(["Hello, world!", "Python programming: An introduction."]) == 6
    assert BaseChunker.get_word_count(["Symbols #$%^&*", "@decorators"]) == 2

# Test with non-string inputs
def test_non_string_inputs():
    with pytest.raises(TypeError):
        BaseChunker.get_word_count([42, "Hello", True])

# Test with an empty list
def test_empty_list():
    assert BaseChunker.get_word_count([]) == 0

# Test with large documents and large number of documents
def test_large_documents_and_large_number_of_documents():
    assert BaseChunker.get_word_count(["word " * 1000]) == 1000
    assert BaseChunker.get_word_count(["doc"] * 1000) == 1000

# Test with Unicode and internationalization
def test_unicode_and_internationalization():
    assert BaseChunker.get_word_count(["Привет мир", "こんにちは世界"]) == 4

# Test immutability and side effects
def test_immutability_and_side_effects():
    documents = ["Hello", "world"]
    original_documents = list(documents)  # Make a copy of the original documents
    BaseChunker.get_word_count(documents)
    assert documents == original_documents  # Ensure original documents are unchanged

# Test documents with only spaces
def test_documents_with_only_spaces():
    assert BaseChunker.get_word_count(["     "]) == 0
    assert BaseChunker.get_word_count(["   ", "  ", " "]) == 0

# Test documents with delimiters other than spaces
def test_documents_with_other_delimiters():
    assert BaseChunker.get_word_count(["word1,word2,word3"]) == 1

# Test documents with numbers and mixed content
def test_mixed_content():
    assert BaseChunker.get_word_count(["The price is $9.99 for 2 items."]) == 7

# Test documents with non-standard whitespace characters
def test_non_standard_whitespace_characters():
    assert BaseChunker.get_word_count(["word\u200bword"]) == 1

# Test input as a generator
def test_input_as_generator():
    assert BaseChunker.get_word_count(doc for doc in ["Hello", "world"]) == 2

# Test unusual Unicode characters
def test_unusual_unicode_characters():
    assert BaseChunker.get_word_count(["Hello🌎World"]) == 1

# Test documents as other iterable types
def test_documents_as_other_iterable_types():
    assert BaseChunker.get_word_count(("Hello", "world")) == 2
    assert BaseChunker.get_word_count({"Hello", "world"}) == 2

# Test extremely long words
def test_extremely_long_words():
    assert BaseChunker.get_word_count(["a" * 1000000]) == 1

# Test handling of hyphenated words
def test_hyphenated_words():
    assert BaseChunker.get_word_count(["well-known"]) == 1

# Test case sensitivity
def test_case_sensitivity():
    assert BaseChunker.get_word_count(["Word WORD WoRd"]) == 3

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Made sure Checks passed

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Feb 16, 2024
Copy link

codecov bot commented Feb 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2985b66) 56.60% compared to head (3cbf64a) 56.59%.
Report is 20 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1268      +/-   ##
==========================================
- Coverage   56.60%   56.59%   -0.02%     
==========================================
  Files         146      146              
  Lines        5923     5955      +32     
==========================================
+ Hits         3353     3370      +17     
- Misses       2570     2585      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant