[Bug] Indexing problem with langchain retrieval augmentation #250

abitofalchemy · 2023-08-07T16:13:04Z

Is this a new bug?

I believe this is a new bug
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

 49%|████████████████████████████████████████▍                                         | 4935/10000 [05:52<02:22, 35.66it/s]E0807 11:33:40.238867000 140704491832896 ssl_transport_security_utils.cc:105] Corruption detected.
E0807 11:33:40.238921000 140704491832896 ssl_transport_security_utils.cc:61] error:100003fc:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_RECORD_MAC
E0807 11:33:40.238930000 140704491832896 secure_endpoint.cc:305]       Decryption error: TSI_DATA_CORRUPTED
 49%|████████████████████████████████████████▍                                         | 4937/10000 [05:54<06:03, 13.95it/s]

When the following code runs:

for i, record in enumerate(tqdm(data)):
    # first get metadata fields for this record
    metadata = {
        'wiki-id': str(record['id']),
        'source': record['url'],
        'title': record['title']
    }
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(record['text'])
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas))
        texts = []
        metadatas = []

https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/ is the reference example that running into an issue.

The implementation is using python 3.8 and on macos (Intel) box.

Expected Behavior

The indexing process should iterate through the data we’d like to add to our knowledge base, creating IDs, embeddings, and metadata — then adding these to the index.

As we do this in batches.

this is from: https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/

Steps To Reproduce

activate conda env using python 3.8 (to be compatible with tiktoken)
run this in a jupyter notebook
Error when this part of the code is in the for-loop:

if len(texts) >= batch_limit:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))
    texts = []
    metadatas = []```
    
    The error is around this, I think ... but I might be wrong:
    ```PineconeException                         Traceback (most recent call last)

Cell In[28], line 21
19 ids = [str(uuid4()) for _ in range(len(texts))]
20 embeds = embed.embed_documents(texts)
---> 21 index.upsert(vectors=zip(ids, embeds, metadatas))
22 texts = []
23 metadatas = []```

Relevant log output

No response

Environment

- **MacOS**: Ventura 13.4.1 (c) (22F770820d)
- **Language version**: `langchain                 0.0.162`
- **Pinecone client version**: `pinecone-client           2.2.2`

Additional Context

I am doing this while connected to a vpn.

The text was updated successfully, but these errors were encountered:

abitofalchemy added the bug Something isn't working label Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Indexing problem with langchain retrieval augmentation #250

[Bug] Indexing problem with langchain retrieval augmentation #250

abitofalchemy commented Aug 7, 2023

[Bug] Indexing problem with langchain retrieval augmentation #250

[Bug] Indexing problem with langchain retrieval augmentation #250

Comments

abitofalchemy commented Aug 7, 2023

Is this a new bug?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context