⚡️ Speed up get_word_count()
by 6% in embedchain/chunkers/base_chunker.py
#1268
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
📄
get_word_count()
inembedchain/chunkers/base_chunker.py
📈 Performance went up by
6%
(0.06x
faster)⏱️ Runtime went down from
1014.51μs
to955.61μs
Explanation and details
(click to show)
Changes:
Changed
sum([len(document.split(" ")) for document in documents])
tosum(len(document.split(" ")) for document in documents)
. This avoids the memory overhead of creating a list either with list comprehensions or a map - sum accepts a generator, not requiring a separate list at all.This makes code more memory efficient, and probably a bit faster, because memory allocation can be an expensive operation and it's not necessary in this case as we are only iterating once over generator and we don't need to use memory to store a list. Python doesn't have to allocate memory to generate a whole new list of word counts and then keep it around while it adds them up. Instead, it can just generate one word count, add it to the running total, and throw it away before moving on to the next.
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
The new optimized code was tested for correctness. The results are listed below.
✅ 26 Passed − 🌀 Generated Regression Tests
(click to show generated tests)
Checklist:
Maintainer Checklist