You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@muazhari currently the text-splitting characters value applies only when overlap is not used. The "base" chunk is split on an even word boundary, it is the overlap-prefix that is producing the mid-word starts to the chunks. You can see that the end of each chunk is on a word boundary. If you turn off overlap (omit the overlap argument or set it to 0) you'll see the element text also starts on a word boundary.
Is it possible for overlap to respect text_splitting_separators? If possible, when will this feature be implemented?
Example:
C = Character
S = Separator
Initial texts:
Text A = C1C2S1C3S2C4
Text B = C5C6S1C7S2C8
Overlapped texts with respect to one of the separators:
Text A = C1C2S1C3S2C4
Text B = C4C5C6S1C7S2C8
Formal:
[A0, An] [B0, Bn]
[A0, An] [Aoverlap_index, A to B separator, B0, Bn]
overlap_index = last separator index of A + 1
Describe the bug
When using
chunk_elements()
, the default parameter value oftext_splitting_separators
does not have any effect.To Reproduce
Expected behavior
The
chunk_elements()
should respect the default parameter value oftext_splitting_separators
.Screenshots
Environment Info
OS version: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version: 3.10.12
unstructured version: 0.13.2
unstructured-inference version: 0.7.25
pytesseract version: 0.3.10
Torch version: 2.2.0
Detectron2 is not installed
PaddleOCR version: 2.6.1.2
Libmagic version: 1:5.41-3ubuntu0.1
LibreOffice version: LibreOffice 7.3.7.2 30(Build:2)
Additional context
Add additional parameter of
text_splitting_separators
tochunk_elements()
to be able to be customized.The text was updated successfully, but these errors were encountered: