Bug: incorrect value of start_index in RecursiveCharacterTextSplitter when substring is present #21475

maggonravi · 2024-05-09T10:29:27Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=5, separators=[" ", ""], add_start_index=True)
splitter.split_documents([Document(page_content="chunk chunk")])

Error Message and Stack Trace (if applicable)

No response

Description

Expected output

[Document(page_content='chunk', metadata={'start_index': 0}),
 Document(page_content='chun', metadata={'start_index': 6}),
 Document(page_content='chunk', metadata={'start_index': 6})]

Output with current code

[Document(page_content='chunk', metadata={'start_index': 0}),
 Document(page_content='chun', metadata={'start_index': 0}),
 Document(page_content='chunk', metadata={'start_index': 0})]

System Info

System Information
------------------
> OS:  Linux
> OS Version:  #1 SMP Thu Feb 1 03:51:05 EST 2024
> Python Version:  3.11.8 (main, Mar 15 2024, 12:37:54) [GCC 10.3.1 20210422 (Red Hat 10.3.1-1)]

Package Information
-------------------
> langchain_core: 0.1.46
> langchain: 0.1.12
> langchain_community: 0.0.28
> langsmith: 0.0.82
> langchain_experimental: 0.0.47
> langchain_text_splitters: 0.0.1
> langchainplus_sdk: 0.0.21

The text was updated successfully, but these errors were encountered:

maggonravi · 2024-05-09T10:31:08Z

This change works.

class TextSplitter(BaseDocumentTransformer, ABC):
    def create_documents(
        self, texts: List[str], metadatas: Optional[List[dict]] = None
    ) -> List[Document]:
        """Create documents from a list of texts."""
        _metadatas = metadatas or [{}] * len(texts)
        documents = []
        for i, text in enumerate(texts):
            index = -1
            previous_chunk_len = 0
            for j, chunk in enumerate(self.split_text(text)):
                metadata = copy.deepcopy(_metadatas[i])
                if self._add_start_index:
                    if j > 0:
                        minimum_index_offset = max(0, previous_chunk_len - self._chunk_overlap, previous_chunk_len - len(chunk))
                    else:
                        minimum_index_offset = 1
                    index = text.find(chunk, index + minimum_index_offset)
                    metadata["start_index"] = index
                    previous_chunk_len = len(chunk)
                new_doc = Document(page_content=chunk, metadata=metadata)
                documents.append(new_doc)
        return documents

Sample code: from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.docstore.document import Document splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=5, separators=[" ", ""], add_start_index=True) splitter.split_documents([Document(page_content="chunk chunk")]) Before this commit: [Document(page_content='chunk', metadata={'start_index': 0}), Document(page_content='chun', metadata={'start_index': 0}), Document(page_content='chunk', metadata={'start_index': 0})] After this commit: [Document(page_content='chunk', metadata={'start_index': 0}), Document(page_content='chun', metadata={'start_index': 6}), Document(page_content='chunk', metadata={'start_index': 6})] This resolves langchain-ai#21475

dosubot bot added Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 9, 2024

maggonravi mentioned this issue May 9, 2024

text-splitters: bug fix for incorrect start_index if the chunk is substring of another chunk #21477

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: incorrect value of start_index in RecursiveCharacterTextSplitter when substring is present #21475

Bug: incorrect value of start_index in RecursiveCharacterTextSplitter when substring is present #21475

maggonravi commented May 9, 2024

maggonravi commented May 9, 2024

Bug: incorrect value of start_index in RecursiveCharacterTextSplitter when substring is present #21475

Bug: incorrect value of start_index in RecursiveCharacterTextSplitter when substring is present #21475

Comments

maggonravi commented May 9, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

maggonravi commented May 9, 2024