Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: PapaCSVReader concatRows=true fails for some .csv files #836

Open
reidperyam opened this issue May 14, 2024 · 11 comments · May be fixed by #843
Open

[Bug]: PapaCSVReader concatRows=true fails for some .csv files #836

reidperyam opened this issue May 14, 2024 · 11 comments · May be fixed by #843
Labels
bug Something isn't working

Comments

@reidperyam
Copy link

reidperyam commented May 14, 2024

Bug Description

The following code fails to generate a vectorStorageIndex from a simple .CSV file (see attached MOCK_DATA.csv ).

import { SimpleDirectoryReader, PapaCSVReader } from "llamaindex";

export const DATA_DIR = "./data";

export async function getDocuments() {
  return await new SimpleDirectoryReader().loadData({
    directoryPath: DATA_DIR,
    fileExtToReader: {
      csv: new PapaCSVReader()
    }
  });
}

Version

0.2.10

Steps to Reproduce

clone the following github repo:

https://github.com/reidperyam/practical-star/tree/master

see README.md for repro which I will copy here:

First, populate .env with OPENAI_API_KEY

install the dependencies:

npm install

verify that all contents of /cache directory are removed!

Second, generate the embeddings of the documents in the ./data directory:

npm run generate

EXPECTED RESULT:

generated cache/doc_store.json
generated cache/index_store.json
generated cache/vector_store.json

ACTUAL RESULT:

Error generating text embeddings: [see Relevant Logs/Tracbacks added here

Relevant Logs/Tracbacks

BadRequestError: 400 This model's maximum context length is 8192 tokens, however you requested 19956 tokens (19956 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.
    at Function.generate (C:\Code\canoe\node_modules\openai\src\error.ts:70:14)
    at OpenAI.makeStatusError (C:\Code\canoe\node_modules\openai\src\core.ts:383:21)
    at OpenAI.makeRequest (C:\Code\canoe\node_modules\openai\src\core.ts:446:24)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async OpenAIEmbedding.getOpenAIEmbedding (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\OpenAIEmbedding.js:100:26)
    at async OpenAIEmbedding.getTextEmbeddings (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\OpenAIEmbedding.js:111:16)
    at async batchEmbeddings (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\types.js:61:32)
    at async OpenAIEmbedding.getTextEmbeddingsBatch (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\types.js:43:16)
    at async OpenAIEmbedding.transform (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\types.js:47:28)
    at async VectorStoreIndex.getNodeEmbeddingResults (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:487:9)
    at async VectorStoreIndex.insertNodes (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:572:17)
    at async VectorStoreIndex.buildIndexFromNodes (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:497:9)
    at async VectorStoreIndex.init (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:445:13)
    at async VectorStoreIndex.fromDocuments (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:523:16)
@reidperyam reidperyam changed the title [Bug]: VectorStoreIndex.fromDocuments with .CSV file does not create OpenAPI embeddings [Bug]: PapaCSVReader concatRows=true fails for some .csv files May 14, 2024
@reidperyam
Copy link
Author

After further investigation this seems to be an issue loading certain .csv files using PapaCSVReader.

as example the .csv used in LlamaIndexTS example documentation loads as expected with the referenced code, above (this is the file:
titanic_train.csv
)

however other tested, valid .csv files are not parsed as expected (see previous, MOCK_DATA.csv)

There is a workaround to circumvent this and allow other .csvs to be loaded.
The default constructor of PapaCSVReader sets concatRows=true, if this is set to false, the document store can be loaded as expected, but this creates a document for each line in the .csv.

@marcusschiesser
Copy link
Collaborator

@reidperyam the error says that the text is too long to generate an embedding.
My guess is that the CSV parser generates very long sentences, can you try:

Settings.nodeParser = new SimpleNodeParser({
  chunkSize: 512,
  chunkOverlap: 20,
  splitLongSentences: true,
});

@reidperyam
Copy link
Author

I added this code :
image
and
image

and the issue remains:

image

I pushed these changes up to the github repro repo @marcusschiesser

@himself65 himself65 added the bug Something isn't working label May 15, 2024
@himself65
Copy link
Member

Sorry but I cannot reproduce this bug

~/Code/practical-star git:[master]
npm run generate

> nextjs@0.1.0 generate
> tsx app/generate.ts

Using 'openai' model provider
CHUNK_SIZE 512
CHUNK_OVERLAP 20
EMBEDDING_DIM 1024
Generating generateDatasource...
STORAGE_CACHE_DIR ./cache
Generating serviceContextFromDefaults...
Generating storageContextFromDefaults...
No valid data found at path: cache/index_store.json starting new store.
No valid data found at path: ./cache/vector_store.json starting new store.
getting docutments...
document ac67672e-d880-4808-8c22-97071ee2f947 loaded
Generating VectorStoreIndex.fromDocuments...
Storage context successfully generated in 0.006s.
Finished generating storage.

@ghost
Copy link

ghost commented May 15, 2024

@himself65 But does it actually create a vector_store.json and index_store.json?

I could reproduce the bug with the tokens, and it only created a doc_store.json
If you then run it again with a doc_store.json in /cache it will show the output as successful like in your snippet above, but it's still not doing anything.

@himself65
Copy link
Member

will check this, I think there are some bugs in node parser

@himself65
Copy link
Member

I think we shouldn't modify the sentence splitter since there is no grammar for a CSV result. So I think it's better to split the CSV to different documents

@himself65 himself65 linked a pull request May 16, 2024 that will close this issue
@ghost
Copy link

ghost commented May 16, 2024

@himself65 Additional information.

If you triple the size of titanic_train.csv by just copy pasting the content twice, it works as it should and creates an index_store.json and vector_store,json that contains the same results thrice, even tho it has more rows and columns.

I could reproduce the error with multiple mock_files from different file generators. But both titanic_train.csv and movie_reviews.csv worked even when increasing the size. So something about the underlying csv structure?

@ghost
Copy link

ghost commented May 16, 2024

@himself65 isolated the issue down to the defaultregex used in TextSplitter.ts

If we extend it to include one or more whitespaces instead of just one whitespace, it fixes the issue:

const defaultregex = /[.?!][])'"`’”]*(?:\s+|$|)/g;

I don't really know enough about document structures to judge if that change would break a lot of stuff or not.
Alternative, add customregex param to SentenceSplitter ?

@ghost
Copy link

ghost commented May 16, 2024

Screenshot 2024-05-16 174539

the defaultregex above is missing the \ in [ ]

@ghost
Copy link

ghost commented May 16, 2024

I think we shouldn't modify the sentence splitter since there is no grammar for a CSV result. So I think it's better to split the CSV to different documents

After further testing the above "fix" only works with the mock file, but still struggles with other files. So true, no proper grammar in csv, so probably quite hard to detect the right regex. So either:
-> Trim csv to a pre-determined structure?
-> Split document before generating nodes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants