[Bug]: PapaCSVReader concatRows=true fails for some .csv files #836

reidperyam · 2024-05-14T15:41:54Z

Bug Description

The following code fails to generate a vectorStorageIndex from a simple .CSV file (see attached MOCK_DATA.csv ).

import { SimpleDirectoryReader, PapaCSVReader } from "llamaindex";

export const DATA_DIR = "./data";

export async function getDocuments() {
  return await new SimpleDirectoryReader().loadData({
    directoryPath: DATA_DIR,
    fileExtToReader: {
      csv: new PapaCSVReader()
    }
  });
}

Version

0.2.10

Steps to Reproduce

clone the following github repo:

https://github.com/reidperyam/practical-star/tree/master

see README.md for repro which I will copy here:

First, populate .env with OPENAI_API_KEY

install the dependencies:

npm install

verify that all contents of /cache directory are removed!

Second, generate the embeddings of the documents in the ./data directory:

npm run generate

EXPECTED RESULT:

generated cache/doc_store.json
generated cache/index_store.json
generated cache/vector_store.json

ACTUAL RESULT:

Error generating text embeddings: [see Relevant Logs/Tracbacks added here

Relevant Logs/Tracbacks

BadRequestError: 400 This model's maximum context length is 8192 tokens, however you requested 19956 tokens (19956 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.
    at Function.generate (C:\Code\canoe\node_modules\openai\src\error.ts:70:14)
    at OpenAI.makeStatusError (C:\Code\canoe\node_modules\openai\src\core.ts:383:21)
    at OpenAI.makeRequest (C:\Code\canoe\node_modules\openai\src\core.ts:446:24)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async OpenAIEmbedding.getOpenAIEmbedding (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\OpenAIEmbedding.js:100:26)
    at async OpenAIEmbedding.getTextEmbeddings (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\OpenAIEmbedding.js:111:16)
    at async batchEmbeddings (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\types.js:61:32)
    at async OpenAIEmbedding.getTextEmbeddingsBatch (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\types.js:43:16)
    at async OpenAIEmbedding.transform (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\types.js:47:28)
    at async VectorStoreIndex.getNodeEmbeddingResults (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:487:9)
    at async VectorStoreIndex.insertNodes (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:572:17)
    at async VectorStoreIndex.buildIndexFromNodes (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:497:9)
    at async VectorStoreIndex.init (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:445:13)
    at async VectorStoreIndex.fromDocuments (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:523:16)

The text was updated successfully, but these errors were encountered:

reidperyam · 2024-05-14T20:58:15Z

After further investigation this seems to be an issue loading certain .csv files using PapaCSVReader.

as example the .csv used in LlamaIndexTS example documentation loads as expected with the referenced code, above (this is the file:
titanic_train.csv
)

however other tested, valid .csv files are not parsed as expected (see previous, MOCK_DATA.csv)

There is a workaround to circumvent this and allow other .csvs to be loaded.
The default constructor of PapaCSVReader sets concatRows=true, if this is set to false, the document store can be loaded as expected, but this creates a document for each line in the .csv.

marcusschiesser · 2024-05-15T13:45:10Z

@reidperyam the error says that the text is too long to generate an embedding.
My guess is that the CSV parser generates very long sentences, can you try:

Settings.nodeParser = new SimpleNodeParser({
  chunkSize: 512,
  chunkOverlap: 20,
  splitLongSentences: true,
});

reidperyam · 2024-05-15T14:18:13Z

I added this code :

and

and the issue remains:

I pushed these changes up to the github repro repo @marcusschiesser

himself65 · 2024-05-15T22:00:25Z

Sorry but I cannot reproduce this bug

~/Code/practical-star git:[master]
npm run generate

> nextjs@0.1.0 generate
> tsx app/generate.ts

Using 'openai' model provider
CHUNK_SIZE 512
CHUNK_OVERLAP 20
EMBEDDING_DIM 1024
Generating generateDatasource...
STORAGE_CACHE_DIR ./cache
Generating serviceContextFromDefaults...
Generating storageContextFromDefaults...
No valid data found at path: cache/index_store.json starting new store.
No valid data found at path: ./cache/vector_store.json starting new store.
getting docutments...
document ac67672e-d880-4808-8c22-97071ee2f947 loaded
Generating VectorStoreIndex.fromDocuments...
Storage context successfully generated in 0.006s.
Finished generating storage.

ghost · 2024-05-15T22:50:02Z

@himself65 But does it actually create a vector_store.json and index_store.json?

I could reproduce the bug with the tokens, and it only created a doc_store.json
If you then run it again with a doc_store.json in /cache it will show the output as successful like in your snippet above, but it's still not doing anything.

himself65 · 2024-05-15T23:55:22Z

will check this, I think there are some bugs in node parser

himself65 · 2024-05-16T00:21:47Z

I think we shouldn't modify the sentence splitter since there is no grammar for a CSV result. So I think it's better to split the CSV to different documents

ghost · 2024-05-16T06:00:00Z

@himself65 Additional information.

If you triple the size of titanic_train.csv by just copy pasting the content twice, it works as it should and creates an index_store.json and vector_store,json that contains the same results thrice, even tho it has more rows and columns.

I could reproduce the error with multiple mock_files from different file generators. But both titanic_train.csv and movie_reviews.csv worked even when increasing the size. So something about the underlying csv structure?

ghost · 2024-05-16T09:04:22Z

@himself65 isolated the issue down to the defaultregex used in TextSplitter.ts

If we extend it to include one or more whitespaces instead of just one whitespace, it fixes the issue:

const defaultregex = /[.?!][])'"`’”]*(?:\s+|$|)/g;

I don't really know enough about document structures to judge if that change would break a lot of stuff or not.
Alternative, add customregex param to SentenceSplitter ?

ghost · 2024-05-16T15:48:48Z

the defaultregex above is missing the \ in [ ]

ghost · 2024-05-16T16:14:33Z

I think we shouldn't modify the sentence splitter since there is no grammar for a CSV result. So I think it's better to split the CSV to different documents

After further testing the above "fix" only works with the mock file, but still struggles with other files. So true, no proper grammar in csv, so probably quite hard to detect the right regex. So either:
-> Trim csv to a pre-determined structure?
-> Split document before generating nodes?

reidperyam mentioned this issue May 14, 2024

[Bug]: VectorStoreIndex.fromDocuments with .CSV file does not create OpenAPI embeddings run-llama/llama_index#13487

Closed

reidperyam changed the title ~~[Bug]: VectorStoreIndex.fromDocuments with .CSV file does not create OpenAPI embeddings~~ [Bug]: PapaCSVReader concatRows=true fails for some .csv files May 14, 2024

himself65 added the bug Something isn't working label May 15, 2024

himself65 linked a pull request May 16, 2024 that will close this issue

fix: truncate long document #843

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: PapaCSVReader concatRows=true fails for some .csv files #836

[Bug]: PapaCSVReader concatRows=true fails for some .csv files #836

reidperyam commented May 14, 2024 •

edited

reidperyam commented May 14, 2024

marcusschiesser commented May 15, 2024

reidperyam commented May 15, 2024

himself65 commented May 15, 2024

ghost commented May 15, 2024

himself65 commented May 15, 2024

himself65 commented May 16, 2024

ghost commented May 16, 2024

ghost commented May 16, 2024 •

edited by ghost

ghost commented May 16, 2024

ghost commented May 16, 2024

[Bug]: PapaCSVReader concatRows=true fails for some .csv files #836

[Bug]: PapaCSVReader concatRows=true fails for some .csv files #836

Comments

reidperyam commented May 14, 2024 • edited

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

reidperyam commented May 14, 2024

marcusschiesser commented May 15, 2024

reidperyam commented May 15, 2024

himself65 commented May 15, 2024

ghost commented May 15, 2024

himself65 commented May 15, 2024

himself65 commented May 16, 2024

ghost commented May 16, 2024

ghost commented May 16, 2024 • edited by ghost

ghost commented May 16, 2024

ghost commented May 16, 2024

reidperyam commented May 14, 2024 •

edited

ghost commented May 16, 2024 •

edited by ghost