Manually setting doc_id on Documents fails results in error with IngestionPipeline using DynamoDB docstore and LanceDB vectorstore #12898
Replies: 1 comment 1 reply
-
The
The If these steps do not help, consider creating a minimal example that reproduces the error and testing it in a different environment to rule out any environment-specific issues. Sources
|
Beta Was this translation helpful? Give feedback.
-
Hello,
I'm trying to build an ingestion pipeline on documents retrieved from an API where I'm creating the Document object by manually setting the doc_id like so:
these are going into a list of documents, let's just call it
documents
, anddoc
is a JSON object andmeta_id
is a hash of the metadata for that document. I want only one version of the docs to persist in my index for every unique meta_id.I'm using an IngestionPipeline with a DynamoDB docstore and a LanceDB vector store, which I've set up like this:
This pipeline does a great job of not re-embedding the same documents twice when there are no changes detected in the document on a give
meta_id
, but when there is a change detected in the document (through the document_hash) I'm getting the following error.OSError: LanceError(IO): Received literal Utf8("0024abc1ff76dded6f223a5cdbcbdb0eca1aaf5181de1598c8d5326e81d199ef") and could not convert to literal of type 'Null', /Users/runner/work/lance/lance/rust/lance/src/datafusion/logical_expr.rs:38:27
where in this example "0024abc1ff76dded6f223a5cdbcbdb0eca1aaf5181de1598c8d5326e81d199ef" is the meta_id with the now outdated document that needs to be reindexed.
Any guidance on this?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions