Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: incorrect value of start_index in RecursiveCharacterTextSplitter when substring is present #21475

Open
5 tasks done
maggonravi opened this issue May 9, 2024 · 1 comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: text splitters Related to text splitters package

Comments

@maggonravi
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=5, separators=[" ", ""], add_start_index=True)
splitter.split_documents([Document(page_content="chunk chunk")])

Error Message and Stack Trace (if applicable)

No response

Description

Expected output

[Document(page_content='chunk', metadata={'start_index': 0}),
 Document(page_content='chun', metadata={'start_index': 6}),
 Document(page_content='chunk', metadata={'start_index': 6})]

Output with current code

[Document(page_content='chunk', metadata={'start_index': 0}),
 Document(page_content='chun', metadata={'start_index': 0}),
 Document(page_content='chunk', metadata={'start_index': 0})]

System Info

System Information
------------------
> OS:  Linux
> OS Version:  #1 SMP Thu Feb 1 03:51:05 EST 2024
> Python Version:  3.11.8 (main, Mar 15 2024, 12:37:54) [GCC 10.3.1 20210422 (Red Hat 10.3.1-1)]

Package Information
-------------------
> langchain_core: 0.1.46
> langchain: 0.1.12
> langchain_community: 0.0.28
> langsmith: 0.0.82
> langchain_experimental: 0.0.47
> langchain_text_splitters: 0.0.1
> langchainplus_sdk: 0.0.21
@maggonravi
Copy link
Author

This change works.

class TextSplitter(BaseDocumentTransformer, ABC):
    def create_documents(
        self, texts: List[str], metadatas: Optional[List[dict]] = None
    ) -> List[Document]:
        """Create documents from a list of texts."""
        _metadatas = metadatas or [{}] * len(texts)
        documents = []
        for i, text in enumerate(texts):
            index = -1
            previous_chunk_len = 0
            for j, chunk in enumerate(self.split_text(text)):
                metadata = copy.deepcopy(_metadatas[i])
                if self._add_start_index:
                    if j > 0:
                        minimum_index_offset = max(0, previous_chunk_len - self._chunk_overlap, previous_chunk_len - len(chunk))
                    else:
                        minimum_index_offset = 1
                    index = text.find(chunk, index + minimum_index_offset)
                    metadata["start_index"] = index
                    previous_chunk_len = len(chunk)
                new_doc = Document(page_content=chunk, metadata=metadata)
                documents.append(new_doc)
        return documents

@dosubot dosubot bot added Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 9, 2024
maggonravi added a commit to maggonravi/langchain that referenced this issue May 9, 2024
Sample code:

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.docstore.document import Document
    splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=5, separators=[" ", ""], add_start_index=True)
    splitter.split_documents([Document(page_content="chunk chunk")])

    Before this commit:

    [Document(page_content='chunk', metadata={'start_index': 0}),
     Document(page_content='chun', metadata={'start_index': 0}),
     Document(page_content='chunk', metadata={'start_index': 0})]

    After this commit:

    [Document(page_content='chunk', metadata={'start_index': 0}),
     Document(page_content='chun', metadata={'start_index': 6}),
     Document(page_content='chunk', metadata={'start_index': 6})]

    This resolves langchain-ai#21475
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: text splitters Related to text splitters package
Projects
None yet
Development

No branches or pull requests

1 participant