Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send document.metadata to AzureAISearch in addVectors() #5274

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

keiohtani
Copy link

Fixes the issue with document.metadata not being sent to Azure AI Search when using AzureAISearchVectorStore.fromDocuments().
AzureAISearchVectorStore.fromDocuments() accepts Document[] as argument which has a parameter document.metadata, but internally this value is not passed to AzureAISearchVectorStore.addVectors() and it could be confusing to the users.

Copy link

vercel bot commented May 2, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchainjs-api-refs ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 2, 2024 9:38pm
langchainjs-docs ✅ Ready (Inspect) Visit Preview May 2, 2024 9:38pm

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. auto:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 2, 2024
@@ -314,6 +314,7 @@ export class AzureAISearchVectorStore extends VectorStore {
metadata: {
source: doc.metadata?.source,
attributes: doc.metadata?.attributes ?? [],
...doc.metadata,
Copy link
Contributor

@sinedied sinedied May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work and most probably triggers an error.
AFAIK Azure AI Search only supports a fixed schema (see index creation), hence the "workaround" of setting metadata as K/V pair in the generic attributes property.

This is something I plan to fix in a later PR by doing the K/V pair transform automatically, and allowing to customize the index schema.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that’s correct. I think we can update the LangChain JS doc to provide instructions on what to do in that case. Actually to use the AzureAISearchVectorStore class, I had to create source and attributes fields in the metadata field anyways, otherwise the fromDocuments() method was failing. I am happy to make those changes in the doc as well if you are ok with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't provide source or attributes it should work anyways as there's a fallback, do you remember what kind of error you had? Metadata should not be mandatory, if there's an issue with empty metadata I'll be happy to fix it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure,

This is the error message when a metadata field is defined in the index,

}
Error: Azure AI Search uploadDocuments batch failed: RestError: The request is invalid. Details: Cannot find nested property 'metadata' on the resource type 'search.documentFields'.
    at EventEmitter.<anonymous> (/home/keiohtani/projects/chatifc-backend/node_modules/@langchain/community/dist/vectorstores/azure_aisearch.cjs:225:19)
    at EventEmitter.emit (node:events:514:28)
    at SearchIndexingBufferedSender.submitDocuments (/home/keiohtani/projects/chatifc-backend/node_modules/@azure/search-documents/src/searchIndexingBufferedSender.ts:463:22)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at SearchIndexingBufferedSender.internalFlush (/home/keiohtani/projects/chatifc-backend/node_modules/@azure/search-documents/src/searchIndexingBufferedSender.ts:400:9)
    at AzureAISearchVectorStore.addVectors (/home/keiohtani/projects/chatifc-backend/node_modules/@langchain/community/dist/vectorstores/azure_aisearch.cjs:228:9)
    at AzureAISearchVectorStore.addDocuments (/home/keiohtani/projects/chatifc-backend/node_modules/@langchain/community/dist/vectorstores/azure_aisearch.cjs:201:25)
    at Function.fromDocuments (/home/keiohtani/projects/chatifc-backend/node_modules/@langchain/community/dist/vectorstores/azure_aisearch.cjs:576:9)
    at async Promise.all (index 0)
    at SummarizationChainService.createEmbeddings (/home/keiohtani/projects/chatifc-backend/src/services/LangChain/SummarizationChainService.ts:94:23)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Azure AI Search does not allow an empty metadata field when creating from LangChain (in this case, we are using Python)
image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, looking at the other method, fromTexts, The way metadatas is handled is confusing.

static async fromTexts(
    texts: string[],
    metadatas: object[] | object,
    embeddings: EmbeddingsInterface,
    config: AzureAISearchConfig
  ): Promise<AzureAISearchVectorStore> {
    const docs: Document<AzureAISearchDocumentMetadata>[] = [];
    for (let i = 0; i < texts.length; i += 1) {
      const metadata = Array.isArray(metadatas) ? metadatas[i] : metadatas;
      const newDoc = new Document({
        pageContent: texts[i],
        metadata,
      });
      docs.push(newDoc);
    }
    return AzureAISearchVectorStore.fromDocuments(docs, embeddings, config);
  }

At first glance, it seems this metadatas parameter will be used in the document creation, but because it's overwritten in the addVectors(), that's not the case and it's really hard to understand the behavior by just looking at the method itself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinedied Following up on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, if you need metadata to be stored you need to use these two properties: https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-community/src/vectorstores/azure_aisearch.ts#L64-L65

It's not ideal I know, that's why I would like to propose a better implementation for this (the current one was based on the Python implementation). I'll probably work on it starting June, with the current plan:

  1. Automatically take any metadata properties passed in and adapt them to fit in the current schema. That would solve your current issue, but it still would allow only string values.
  2. Propose a way to customize the schema and the attributes used for storage.

@bracesproul bracesproul added the hold On hold label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature hold On hold size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants