Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Indexing problem with langchain retrieval augmentation #250

Open
2 tasks done
abitofalchemy opened this issue Aug 7, 2023 · 0 comments
Open
2 tasks done

[Bug] Indexing problem with langchain retrieval augmentation #250

abitofalchemy opened this issue Aug 7, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@abitofalchemy
Copy link

Is this a new bug?

  • I believe this is a new bug
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

 49%|████████████████████████████████████████▍                                         | 4935/10000 [05:52<02:22, 35.66it/s]E0807 11:33:40.238867000 140704491832896 ssl_transport_security_utils.cc:105] Corruption detected.
E0807 11:33:40.238921000 140704491832896 ssl_transport_security_utils.cc:61] error:100003fc:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_RECORD_MAC
E0807 11:33:40.238930000 140704491832896 secure_endpoint.cc:305]       Decryption error: TSI_DATA_CORRUPTED
 49%|████████████████████████████████████████▍                                         | 4937/10000 [05:54<06:03, 13.95it/s]

When the following code runs:

for i, record in enumerate(tqdm(data)):
    # first get metadata fields for this record
    metadata = {
        'wiki-id': str(record['id']),
        'source': record['url'],
        'title': record['title']
    }
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(record['text'])
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas))
        texts = []
        metadatas = []
        

https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/ is the reference example that running into an issue.

The implementation is using python 3.8 and on macos (Intel) box.

Expected Behavior

The indexing process should iterate through the data we’d like to add to our knowledge base, creating IDs, embeddings, and metadata — then adding these to the index.

As we do this in batches.

this is from: https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/

Steps To Reproduce

  1. activate conda env using python 3.8 (to be compatible with tiktoken)
  2. run this in a jupyter notebook
  3. Error when this part of the code is in the for-loop:
if len(texts) >= batch_limit:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))
    texts = []
    metadatas = []```
    
    The error is around this, I think ... but I might be wrong:
    ```PineconeException                         Traceback (most recent call last)

Cell In[28], line 21
19 ids = [str(uuid4()) for _ in range(len(texts))]
20 embeds = embed.embed_documents(texts)
---> 21 index.upsert(vectors=zip(ids, embeds, metadatas))
22 texts = []
23 metadatas = []```

Relevant log output

No response

Environment

- **MacOS**: Ventura 13.4.1 (c) (22F770820d)
- **Language version**: `langchain                 0.0.162`
- **Pinecone client version**: `pinecone-client           2.2.2`

Additional Context

I am doing this while connected to a vpn.

@abitofalchemy abitofalchemy added the bug Something isn't working label Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant